Upload
votu
View
228
Download
3
Embed Size (px)
Citation preview
A Framework for Analyzing Architecture-Level
Fault Tolerance Behavior in Applications
by
Harshad Sane
B.S., University of Pune, India 2004
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Electrical and Computer Engineering
2008
This thesis entitled:A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in
Applicationswritten by Harshad Sane
has been approved for the Department of Electrical and Computer Engineering
Professor Daniel A. Connors
Professor Manish Vachharajani
Professor Li Shang
Date
The final copy of this thesis has been examined by the signatories, and we find thatboth the content and the form meet acceptable presentation standards of scholarly
work in the above mentioned discipline.
iii
Sane, Harshad (M.S., Computer Engineering)
A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications
Thesis directed by Professor Daniel A. Connors, Ph.D.
Radiation-induced transient faults, also known as single-event upsets, are a major
concern in the design and operation of modern computer systems. Transient errors first
impact the circuit and logic levels of a processor, and may propagate to the micro-
architecture and architecture state of a processor. When undetected, transient errors
in architecture state can lead to incorrect and undefined application behavior.
Detailed simulation is a vital component of the design process of modern pro-
cessors and exploration of new design concepts. However, high-level architectural sim-
ulators typically run 100-1000x slower, making detailed simulation of most programs
prohibitively long. Due to the deficiencies in simulation technology, current architecture-
level transient fault studies are primarily based on sampling techniques. Current pro-
gram fault behavior analysis uses architecture-level injection of random bits selected
over a time line. Such injection methods allow only a limited number of injections per
unit time, a high percentage of which may not expose the true fault susceptibility nature
of a program.
This thesis introduces an accurate and fast fault-injection framework for studying
inherent code execution properties of a program that correlate to different levels of fault
susceptibility. The framework utilizes a three step approach consisting of trace logging,
fault injection and replay execution to emulate the effect of a transient fault in the
architectural registers of a running application. Finally by correlating the injection
analysis results to dependence graph patterns across sets of applications, an analysis
methodology is constructed to accurately predict the fault tolerance of an application
without performing any error injections.
Dedication
I dedicate this thesis to my family for their unconditional support.
v
Acknowledgements
Firstly I would like to thank my adviser, Professor Dan Connors who guided me
through my academic endeavor at CU. I would like to thank him for all the knowledge
and encouragement he bestowed upon me. This work has been possible because of his
guidance and prompt feedback.
I would like to thank all the members of the DRACO research group for their
insight and ideas.
I would like to thank my friends who made sure that I enjoyed life along with
my career. Most importantly, I would like to thank my family who have supported me
unconditionally throughout out my life.
vi
Contents
Chapter
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Single Event Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Architecture-level Fault Analysis . . . . . . . . . . . . . . . . . . . . . . 6
3 Motivation 8
3.1 Overview of current injection methodologies . . . . . . . . . . . . . . . . 8
3.2 Fault penetration and point of injection . . . . . . . . . . . . . . . . . . 9
3.3 Natural fault resilience and ineffective injections . . . . . . . . . . . . . 11
4 Fault Emulation Framework and Results 14
4.1 Overview and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 The TEFIS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Execution Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Fault Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Fault Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Trace Emulation Experimental Results and Analysis . . . . . . . . . . . 21
4.3.1 Fault tolerance categories . . . . . . . . . . . . . . . . . . . . . . 22
vii
4.3.2 Accuracy of emulations . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 Average Execution Time of Fault Emulations Against Full Injections 24
5 Analysis of Fault Tolerance Program Behavior 26
5.1 Source Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Algorithm Level Effects . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 Dynamic Source Code Behavior . . . . . . . . . . . . . . . . . . . 28
5.2 Dynamic Program Trace Behavior . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Inter-Procedural Fault Tolerance . . . . . . . . . . . . . . . . . . 31
5.3 Dynamic Dependence Graph Representation . . . . . . . . . . . . . . . . 33
5.4 Estimating Program Fault Tolerance . . . . . . . . . . . . . . . . . . . . 36
6 Future Work 38
7 Conclusion 40
Bibliography 41
viii
Tables
Table
3.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Fault tolerance characteristics of logical operations. . . . . . . . . . . . . 12
ix
Figures
Figure
2.1 A neutron strike. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Feature size Vs Soft error rate. . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Current random injection methodologies. . . . . . . . . . . . . . . . . . 9
3.2 Fault penetration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Fault injection distribution in time. . . . . . . . . . . . . . . . . . . . . . 11
3.4 A high fault tolerant case from 186.crafty with 80.6% correct results of
emultion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Fractional emulation captures the effect of fractional execution with traces
against full injections in the entire application. The figure shows the ex-
clusion of build-up and monitoring time. . . . . . . . . . . . . . . . . . . 16
4.2 Context dump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Framework overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Fault emulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Result categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Emulation accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 A comparison of execution time. . . . . . . . . . . . . . . . . . . . . . . 25
5.1 Fault tolerance analysis of sorting routines (a)Heapsort (b)Quicksort. . . 27
5.2 A trace from 164.gzip with 91.4% incorrect results of emulation. . . . . 29
x
5.3 A trace from 300.twolf with 100% segmentation fault results of emulation. 30
5.4 Fault tolerance correlation with program counter similarity. . . . . . . . 31
5.5 Fault susceptibility correlation calculated using similar program code
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Inter-procedural fault tolerance. . . . . . . . . . . . . . . . . . . . . . . . 33
5.7 Example: Dependency graph. . . . . . . . . . . . . . . . . . . . . . . . . 34
5.8 Dependence graph similarity correlation. . . . . . . . . . . . . . . . . . . 35
5.9 Dependence graph similarity correlation - All benchmarks. . . . . . . . . 35
5.10 Fault tolerance prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 1
Introduction
Scaling trends in technology that lead to faster and smaller transistors, lower
voltage levels and reduced noise margins also increase the susceptibility of a circuit to
transient faults. Shielding systems in hardware from radiation, cosmic rays and cross
talk is difficult from a high-speed design perspective and is costly in terms of active power
consumption. Transient errors first impact the circuit and logic levels of a processor,
and can propagate to the micro-architecture and architecture state. Architecture state
errors lead to invalid and unpredictable software behavior. As many program phases
are more tolerant of some single-bit architecture errors, software shielding becomes an
exciting solution due to its low cost and flexibility. Many software-based fault tolerant
techniques [24, 23, 10] have been proposed to balance performance with error detection
and recovery.
Current studies involving the observation of effects of transient faults utilize a
limited number of random fault injections in which micro-architecture (pipeline reg-
isters, intermediate logic) or architecture (register, memory) state is modified during
simulated execution. As these error injection campaigns involve emulating machine
execution, each injection run includes substantial experimental time leading up to the
point of injection, and the remaining execution of the application to determine program
correctness. Collectively the excessive time for simulated injections limit the points of
2
the program execution that can be studied, thus reducing the significance of under-
standing the fault tolerance behavior in applications.
There are a number of ways to improve the accuracy and execution time of fault
analysis over traditional fault injection systems. Firstly, as applications are charac-
terized by repeating phases [5], there are opportunities to reduce the number of fault
injections by studying representable phases of execution. Furthermore random injects
often do not expose relevant program behavior as significant portions of program ex-
ecution involve dead code [6] and value locality [16]. There is substantial potential to
model fault behavior of code sequences by correlating the results of fault injections to
dynamic code regions. In this way, a fault analysis modeling framework can be con-
structed anticipate the fault susceptibility of an application based on execution profile
of code regions which do not require fault injections.
1.1 Contributions
This thesis presents an experimental study of the current architecture-level injec-
tion techniques for evaluating fault tolerance. Based on this study, the first half of the
thesis motivates the need and design of a new infrastructure for fault injection method-
ology. The framework is based on trace logging and performing all possible architecture
faults of the trace through an injection emulation system (TEFIS). An analysis of the
approach order to improve the accuracy and timeliness of the system is examined. The
next section focuses on correlating the properties of a program with inherent fault toler-
ance characteristics. Based on this correlation, an experimental model is constructed to
predict the fault tolerance of a program without any injections. This model would pro-
vide a baseline for making fault tolerance predictions of application. The above points
are encompassed as contributions. These are the following contributions made to this
thesis:
3
(1) Development of a fast and accurate fault modeling framework: A new methodol-
ogy for evaluating the transient fault tolerance of program regions is presented.
The framework deploys re-play execution of selected program traces that pro-
vide the same accurate results of full-scale fault injection in a fraction of the
experimental evaluation time.
(2) Demonstration of correlation between fault tolerance and program structure:
The fundamental code property of the dynamic dependence graph of archi-
tecture state is analyzed to expose patterns exhibiting various levels of fault
susceptibility. Case studies are examined to reveals relations between source
code structures and fault tolerance behavior.
(3) Fault tolerance prediction based on studied program behavior : Estimation of
fault tolerance of an application by constructing an analytical model to accu-
rately assess fault tolerance to code regions.
The following sections elaborate on each of thesis contributions. The thesis con-
cludes by proposing possible enhancements to the framework and scope for future work
with this system.
Chapter 2
Background
2.1 Single Event Phenomenon
Radiation effects on processors is a major concern for architects with reduction in
transistor features. Among these effects, bit flips resulting from the ionization by neu-
tron strikes from cosmic rays and alpha particles are considered critical source of upsets
owing to their random occurrence. These effects, called Single Event Upset (SEU), con-
stitute a serious threat to the reliability of digital equipment built on advanced circuits.
Single event phenomena can be classified into three effects:
I Single event upset (soft error)
II Single event latch up (soft or hard error)
III Single event burnout (hard failure)
Single event upsets is defined by NASA as ”radiation induced errors in microelec-
tronic circuits when charged particles lose energy by ionizing the medium through which
they pass, leaving behind a wake of electron-hole pairs” [2]. These electron hole pairs
generate charge as they combine, and if this charge is greater than the critical charge
of the device, it results in a change of state. Transient faults fall under this category of
SEUs. Figure 2.1 is an example of a neutron strike on a transistor.
5
Figure 2.1: A neutron strike.
Transient faults are emerging as a major concern for architects designing reliable
computer systems [3, 17]. Trends in silicon process technology report a bleak future in
terms of the fault susceptibility of an application. While the future error rate of a single
transistor is expected to stay relatively constant [11, 13], the number of transistors per
chip continues to increase at an exponential rate. As a result, the overall error rate for
processors is expected to increase dramatically, making fault tolerance as important a
design characteristic as performance, power consumption, and temperature dissipation.
Figure 2.2 shows the effect of reduction in feature size with years to come. The degra-
dation rate is about 8% per bit per generation [4] and follows the curve as shown in the
figure.
Hardware designs can be customized for fault tolerant execution with redundant
resources such as latches or extended pipelines. Providing fault tolerance may require
the addition of hundreds of thousands of delay latches and 20-30% logic to an existing
processor [28]. Other more specialized approaches create even more sophisticated sys-
tems requiring both hardware and software integration [1, 31]. While these approaches
work well in their specific scientific computing domain, the general purpose design field
must adapt to the need for fault tolerance in fundamentally different ways. As design
6
cycle time is critical, many chip designers propose implementing redundancy-based fault
tolerance using existing multi-core and multi-threaded process extensions [9, 18]. The
driving motivation is to extend the engineering decision towards multi-context processor
to provide fault tolerance.
Figure 2.2: Feature size Vs Soft error rate.
2.2 Architecture-level Fault Analysis
Most recent architecture research is focused on using performance models to pro-
vide Architecture Vulnerability Factor (AVF) estimates of processor reliability rather
than deploying detailed fault injection into hardware RTL models. AVF is defined as
the probability that a fault in that particular structure will result in an error in the final
output of a program [19]. A structure’s error rate is the product of its raw error rate, as
determined by process and circuit technology, and the AVF. Processor designers can use
AVF analysis to determine the processor structures in probabilistic need of protection
(e.g., structures with high AVF are likely to be protected). Some structures, such as
the branch predictor, have no effect on whether an error will propagate to the output of
the program. In contrast, other structures are on the opposite end of the spectrum such
7
as the instruction issue window, load-store queue, and re-order buffer. The majority of
hardware structures fall in the middle of the two extremes.
While AVF analysis provides support for investigating new fault tolerant architec-
ture techniques, program execution characteristics are largely missing from the determi-
nation of periods of software error susceptibility. A software-centric view makes this key
insight: although faults occur at the hardware level, the only faults which matter are the
faults which affect software correctness. By changing the boundaries of output compari-
son to software, a software-centric model shifts the focus from ensuring correct hardware
execution to ensuring correct software execution. As a result, only faults which affect
correctness are detected. Benign faults are safely ignored. A software-centric system
with both detection and recovery, will not need to invoke the recovery mechanism for
faults which do not affect correctness. The primary problem with AVF is that software
periods of vulnerability substantially differ from micro-architecture periods of vulner-
ability. As research trends dictate, finding ways to selectively enable transient fault
tolerant mechanisms, run-time and off-line experimental techniques must be guided
equally by program behavior and hardware. As such, it is important to determine and
predict when program susceptibility and hardware susceptibility differ.
Chapter 3
Motivation
3.1 Overview of current injection methodologies
A major problem in the development of fault-tolerant systems is the accurate
determination of the dependability properties of the system. Unlike performance, which
can be evaluated through the use of benchmark programs, the degree of fault tolerance
and reliability of a system cannot be evaluated in such a manner. This is because we
do not often have the luxury of allowing systems to run for a very long time to see their
behavior under fault effects. The generally preferred solution to this problem is to inject
the effects of faults in a simulation model or a prototype implementation, and to observe
the behavior of the system under the injected faults. Fault injection in a simulation is
very flexible but far too time consuming. On the other hand, it is much more difficult
to inject accurate (i.e. realistic) faults into a prototype, though their effect is readily
observable.
Figure 3.1 shows the current methods of fault injection, where random injection
procedure has been adopted. The following graph shows the number of injections per
benchmark used in recent papers that adopted random fault injection methodology. Ta-
ble 3.1 are the list of references for each of the published articles numbered in Figure 3.1.
9
Figure 3.1: Current random injection methodologies.
Number Short Description Reference(1) Soft-error detection through software TFT techniques(99) [22](2) Y-Branches(03) [29](3) Characterizing TF effects on processor pipeline(04) [30](4) Configurable TF detection via dynamic binary translation(06) [24](5) Symplfied(08) [20](6) Using PLR to exploit multi cores for TFT(07) [27]
Table 3.1: References
A recent paper focusing on instruction level error derating adopted an interval
injection methodology [7]. The injection campaign includes 100 uniformly distributed
points of injection in trace lengths of 100 instructions, in 32 as well as 64 bit registers,
resulting up to 224 injections per instruction. This method does show a representative
set of experiments without having to simulate the entire benchmark.
3.2 Fault penetration and point of injection
Fault injections can be performed at geometric layout, circuit, gate or block level
models. The block level model is a functional view defining data and control paths of
the application. Logic gates can go through several levels of masking like electrical and
10
latching window and logical before they can affect the behavior of an application [26].
From a user point of view, it only matters if a transient fault causes undesirable effects
in the application. Hence our study can utilize fault injections at the block level as long
as they emulate the same effect of the propagation of a hardware fault. The process
of injecting faults into the architectural registers captures the above notion as shown
in figure 3.2, although logical masking would still persist between the architectural and
application layers.
Figure 3.2: Fault penetration.
Current injection techniques corrupt a single bit at 1000 random execution points [25,
27]. As discussed previously, such tests do not regard program behavior and have sub-
stantial variation. Figure 3.3 demonstrates the cumulative time for 1000 random fault
injections. The injection campaign time is sorted to account for the longest running
injections then the shorter injections. Some of the runs only take a matter of seconds,
while others take several minutes to complete to understand whether the program’s
behavior was changed. Based on Figure 3.3, the view of performing statistically signif-
icant fault injection using random or interval based schemes would require substantial
computation efforts. Clearly the use of fault injections into program state must be
strategically guided to gather fault outcomes for only certain regions of interest.
11
Figure 3.3: Fault injection distribution in time.
3.3 Natural fault resilience and ineffective injections
As stated above though electrical and latching window masking is covered by
injecting faults at the architectural level, logical masking of faults are overlooked in ran-
dom injection methods. Examples of logical masking effects include logical operations,
conditional operations, overwriting faulted values before use, binary return values, dy-
namically dead instructions [6], silent stores [15] etc. This is why a large number of
random injections do not expose the fault susceptibility of a program.
For example, the table 3.2 shows the probable levels of tolerance with logical
operations at bit level. The table compares the logical operation of a register either
with itself or with another operand. The effects of the values presented in this table can
be realized from their truth tables. From the table it can be seen that an XOR operation
of a register on itself has 100% fault tolerance since it simply clears the register. Hence
any bit perturbation before this operation would have no effect whatsoever.
Figure 3.4 is an example of logical masking obtained from a trace of 186.crafty.
The figure shows the source code of a function from the benchmark along with its
12Operation Other Operand Itself
AND,OR,NAND,NOR 50% 0%XOR,XNOR 0% 100%
NOT 0% -
Table 3.2: Fault tolerance characteristics of logical operations.
dynamic block execution. It can be seen from the source code, that all this function
does is return a binary value (true or false) based on the movement of a white or black
piece in a game of chess. The analysis shows that with such a dependence flow, the
effect of flipping bits would hardly impact the return values.
Random methods of fault injection impose a limitation on the number of injections
per unit time, a high percentage of whose susceptibility may be low as per the above
example. These limitations tend to invoke a motivation in developing a system which
could not only be folds of magnitude faster but also as accurate as the full injections.
Besides it should also be able to capture the properties of the program that affect its
fault susceptibility.
TEFIS (Trace emulation fault injecting system) is a powerful technique for emu-
lating hardware faults using software and uses PIN, a binary instrumentation tool for
tracing and emulation, which form the two major portions of the framework. Traces are
captured along with their state information from a running executable based on various
user defined parameters. This is followed by a rigorous fault injection procedure and
replay execution of the faulted trace, but only up to the length of the trace. This ap-
proach provides the desired flexibility, and at the same time, allows execution of many
experimental runs in a relatively very short period of time.
13
(a) Source code (b) Dynamic control flow
Figure 3.4: A high fault tolerant case from 186.crafty with 80.6% correct results ofemultion.
Chapter 4
Fault Emulation Framework and Results
4.1 Overview and Goals
Software implemented fault injection methodologies can be broadly classified un-
der compile time and run time injections. A run time fault injection system would truly
emulate the effect of a real time transient fault occurrence. Current methods of run time
software fault injection are either time based or interrupt driven. These are program
level fault injection techniques which are generally unguided. Faults are injected either
randomly, interval based or within phases of the application. Phase based injection
would exploit some of the properties of an application. In either case, these methods
would suffer from the following drawbacks:
• Experimental time for a single injection : This time consists of two parts, the
build up time and monitoring time. Build up time would be the time required
for full execution up until fault point, while monitoring time is what follows the
injection until an outcome.
• Multiplied impact of the single injection overhead : There is a limitation in
studying the complete application due the single injection overhead time mul-
tiplied over the number of injections applied to the application. For evaluating
15
the fault tolerance of the application, a statistically significant amount of injec-
tions are necessary, which consumes a lot of time, limiting the exposure of the
application to a fewer number of faults.
• Inaccuracy in determination of dependability properties of a system : Due to
the combined effect of he above two points, limited number of injections provide
low determinism in understanding the dependency behavior of the system under
all kinds of fault behavior.
Keeping the above limitations in mind, there is a requirement for a system which
overcomes the need to run full executions of the application for each injection, as well as
explore the dependability properties of a system that affect its fault susceptibility. The
trace emulation fault injecting system (TEFIS) has been developed with the following
goals in mind:
(1) Accuracy in fault tolerance with fractional execution : This frame work shall
employ fractional traces for execution within the binary. Traces of any length
can be obtained at any time point within the application. Accuracy is deter-
mined by comparing results of this fractionally injected execution to those of full
injections. Figure 4.1 shows a comparison between full injection and fractional
emulation on trace lengths of 100 and 200 instructions, where the dark regions
show the fractional length of execution.
(2) Very high speed injections and summary generation : Length of a single run
extends from the point of injection to the end of the trace. This eliminates the
build up time before the point of injection and monitoring time after the end
of the trace, limiting the execution time to only the number of instructions per
trace. This method not only speeds of result collection but also provides room
for a large number of emulations.
16
Figure 4.1: Fractional emulation captures the effect of fractional execution with tracesagainst full injections in the entire application. The figure shows the exclusion of build-up and monitoring time.
(3) Expose deeper understanding of program behavior : Program properties have
been studied based on their possible impact on the fault susceptibility of the
application. Two methods of correlation have been described and used in this
analysis, code region similarity and data dependence similarity.
4.2 The TEFIS Framework
The framework has been designed to have functionally independent units. Each
of these units are presented in the order in which they are executed within the frame-
work. The three units of this procedure are execution tracing, fault generation and fault
emulation.
The tracing and emulation system uses PIN [21], a binary instrumentation tool
developed at Intel. PIN is used in this framework for generating and loading context
information at an instruction level.
17
4.2.1 Execution Tracing
Execution tracing is a one time procedure involving generation of traces from the
binary with a number of controllable parameters of operation. The framework provides
operational flexibility on tracing methods with the following options:
• Length - Traces of any length can be generated
• Time
∗ Any point in time
∗ Interval - uniform intervals with uniform lengths
• Phase - Integration with SIMPOINTS [5]
• Function specific tracing
This system uses a tool called ExecutionTracer for this purpose. The Execution-
Tracer is a PIN [21] tool capable of dumping snippets of the binary at run time with
predefined knobs for the user to control the length and starting point of the trace. Once
these parameters are passed in, every instruction is instrumented before its execution.
This instrumentation constitutes gathering of the following information:
• Trace context : Disassembly with instruction pointer which give control flow
information
• Register context : Value stored within each of the 32 bit registers
• Memory context : Any values read from or written to memory along with their
addresses.
• Dependence graphs : Data flow dependence information generated in graphical
format with the help of dot [12]
18
• Edge context : Edge information at each basic block (Not used as of now)
An example format for context dumps within a trace is shown in Figure 4.2. The
figure shows a register context which is generated for each instruction, a memory context
if exists for that instruction and the disassembly for that instruction, in the form of a
trace context with its instruction pointer information.
The conventions for storing each context have the context name followed by the
trace number. For trace generation these files form static storage, but for fault emula-
tions, they are created and destroyed on the fly as required.
Figure 4.2: Context dump.
4.2.2 Fault Generation
Fault generation follows the one time tracing procedure. This is a highly iterative
process that repeats a number of times in a trace. The bit flips required for fault
emulation are generated in this step. The tool used for bit flips needs to unfold through
a number of stages which are listed sequentially as follows:
• Trace - A trace generated from the binary forms the top level entry
• Instruction - Each instruction within a trace forms the next level of entry
19
• Source registers - Every 32 bit source register for an instruction, if present forms
the next level of entry
• Bits - Every bit in the 32 bit source register is flipped and forms a single emu-
lation
Figure 4.3 gives an overview of the tracing and fault generation procedure. The
right half of the figure depicts the one time tracing system, while the left hand side shows
the fault generation and emulation steps. The box in the middle shows a possible future
step of less susceptible instruction filtering. The right half block has three columns
representing the binary, a trace and the context. The binary is shown to have been
split into traces, with each trace having a number of instructions and each instruction
having a context associated with it. The fault generation portion is associated for a
single trace. The figure shows number of emulations for each instruction in the trace.
Also within each instruction, 32 flips per source register are shown for each emulation.
This structure repeats over all the traces within the executable.
Faulted contexts are saved for each emulation in a flat directory structure with
extensions to each file name representing the trace and emulation it belongs to. This
step prepares the faulted contexts for their actual runs during fault emulation.
4.2.3 Fault Emulation
This step runs the binary loaded with the faulted context. Each run consists
of a context with a single bit flipped from within a source register in an instruction
within the trace. Though there are a large number of runs of the executable, their
execution length is limited only to the end of the trace to which the emulation belongs.
Faulted register contexts from the fault generation stage along with the memory context
as dumped during tracing are loaded at run time using instrumentation via PIN [21].
20
Figure 4.3: Framework overview.
Since the instruction pointer is known from the trace, the exact dumped contexts at
an instruction are loaded at the exact same point, but with one of the source register
bits flipped. After this initial loading the binary is run from this state for the same
number of instructions that follow it in the original trace. This process is repeated over
all emulations till half of the trace length. i.e. First 50% instructions within the trace.
The framework lets each trace represent a folder with all the emulations for the trace
residing in that folder.
Figure 4.4 shows the fault emulation system with the reference trace on the left
and the emulation trace on the right. The reference trace is shown for the purpose of
result comparison. The figure shows a point of injection, which consists of loading of
the memory context and the faulted register context based on the context gathered for
the same instruction in the reference trace. When the binary is run with the state set
in such a manner, there are five different possibilities in execution.
• Data flow deviation: This results in a context mismatch with or without a
control flow deviation
• Control flow deviation: Instruction pointers show a mismatch in the traces
21
Figure 4.4: Fault emulation.
generally always accompanied by a data flow deviation as expected, since a
totally new set of instructions are encountered.
• Signal Fault: Trace execution aborts due to inappropriate memory accessing
signaled as an error from the operating system.
• Timeout: Cases where the trace gets stuck in a loop or hangs, which is when
execution is forced to halt based on a set timer. These cases are generally very
rare due to the design of the system.
• None: The emulation exactly matches the reference trace in all contexts
The results are based on comparing the contexts of the reference trace with the
emulated trace at the end of execution of the last instruction. The results are placed in
bins indicating mismatches, signal faults or correct execution.
4.3 Trace Emulation Experimental Results and Analysis
12 sets of integer benchmarks from SPEC2000 suite were used as candidate
applications for the fault emulation experiments. Full injections were also performed on
22
the same set of traces that were used for emulation purpose. 1000 traces per benchmark
were generated for the experiment. Trace lengths of 100 and 200 instructions were
selected and emulations were run across half the length of each trace. Since the start of
the trace lengths of 100 and 200 instructions were the same, the 200 trace length covers
the characteristics of the prior and more.
4.3.1 Fault tolerance categories
Each emulation was compared with the respective reference trace on the fly,
and results were generated. The basis of grouping the results of emulation can be
summarized in the following manner:
• Mismatch: This case covers a mismatch in either the register, memory contexts
or control flow
• Signal Fault: The emulation tool is equipped with handlers to catch any signal
faults from the operating system.
• None: All contexts and control flow match at the end of the trace.
Figure 4.5 shows the fault tolerance characteristics of each benchmark. The fig-
ure depicts fault tolerance based on the overall categories of results mentioned above.
The groupings in the left figure are made based on a cut-off percentage. i.e ”Mostly
incorrect” signifies over 50% incorrect with the number of correct entries less than 10%
in that trace. Same is the case for the segmentation fault case, while the ”Correct”
bin accumulates all correct entries over 10% in the trace. This figure shows an overall
characteristic of each benchmark displaying its fault tolerance behavior. From the figure
it can be seen that 186.crafty and 256.perlbmk show a high fault tolerant behavior.
23
Figure 4.5: Result categories.
4.3.2 Accuracy of emulations
As described previously the goals of the framework is to prove two contributions;
accuracy of the frame work and speed of execution based on comparison with full injec-
tions. For this purpose the same traces used for emulation were taken and full injections
were performed on each bit of every source register found in every instruction of the
trace. Since the full injections run the complete binary and base the results on entire
execution, the results of emulation are compared against these for determining accuracy.
Figure 4.6 shows the accuracy of emulations with two trace lengths against full
injections. The figure shows 3 vertical bars per benchmark, the first two being emula-
tions of trace length 100 and 200 instructions respectively, while the last bar showing
full injections. The categories are in the same order as mentioned before. The Y axis
shows the percentage contribution per category. The accuracy of emulations is deter-
mined by the closeness of the correct entries for the 100 and 200 instruction trace length
emulations to the full injections. From the figure it seems that the accuracy is fairly
good for trace lengths of 100 and 200 instructions. Though application dependent, the
real injection data seems fairly close to the emulation accuracy observed with above
24
trace lengths.
An enhancement to this would include trace lengths of increasing order above
200 instructions. An experiment was conducted to monitor the position of the control
deviation due to fault injections. It was observed that a control flow would occur
generally between 100 to 200 basic blocks. That gives an approximate length of 1000
instructions considering an average of 5 instructions per basic block. A saturating curve
for correctness should be observed with increasing length of traces, somewhere upon this
value. This would follow an analysis to see whether an increasing trend of correct entries
is observed closer to the full injections with this method.
Figure 4.6: Emulation accuracy.
4.3.3 Average Execution Time of Fault Emulations Against Full Injec-
tions
Figure 4.7 shows a comparison between the times of execution of 1000 emulations
to the same number of full injections. Notice that the Y axis is on a logarithmic scale.
It can be seen that the execution times for emulations are much smaller than the full
25
injections. Also this execution time for emulations remains constant over all benchmarks
since the length of instructions executed remains more or less the same.
Figure 4.7: A comparison of execution time.
The above two results are the most important which help in proving the accu-
racy of the emulation system as well as its ability for having execution times folds of
magnitude lesser than that for full injections. The system also has the flexibility to add
the extra block for ineffective instruction filtering which would even further enhance its
ability to capture the fault susceptibility of an application as well as reduce execution
times.
Chapter 5
Analysis of Fault Tolerance Program Behavior
The above sections have proved the accuracy and reduced execution time of the
emulation framework. This section would look in to correlating the observed fault
tolerance obtained from the system to the program behavior. This is necessary in un-
derstanding the relation between the properties of a run time program and its fault
susceptibility. The criteria applied for selecting such properties is based on the assump-
tion that similar code execution should have similarity in its fault susceptibility.
5.1 Source Code Analysis
Finding properties of a program that affects its fault tolerance require a much
deeper understanding of what is going on within the source. An application may be
analyzed by either looking at its algorithm level or dynamic assembly to understand
its fault susceptibility. This section shall look at algorithmic effects on fault tolerance
as well as do a source code analysis in order to understand program behavior from a
reliability point of view.
5.1.1 Algorithm Level Effects
To view an algorithm based effect on fault tolerance, fault injections were applied
to sorting routines, which basically performed the same function, using the same inputs
27
but applied different algorithms. These applications were also compiled under different
versions to observe the effect of optimizations across each. A portion of the results
are displayed in Figure 5.1 to show the effect of algorithm on the fault tolerance of an
application.
(a) Heapsort
(b) Quicksort
Figure 5.1: Fault tolerance analysis of sorting routines (a)Heapsort (b)Quicksort.
28
5.1.2 Dynamic Source Code Behavior
A PIN [21] tool was written for the purpose of tracking the dynamic flow of
instructions in basic blocks along with their disassembly. The tool is capable of taking
in a function name and dump a static control flow graph with markings for its dynamic
execution. The function names were provided from the functions that occurred in the
traces, which were dumped using a script that used another simple PIN [21] tool. This
gives an idea of what the trace is actually executing and a complete view of how a fault
propagates.
Sections of these traces were processed through this tool to analyze which instruc-
tions the fault encountered during its flow dynamically through the program. A trace
having fault tolerance has already been illustrated in the motivation section. Further
cases of traces with low fault tolerance and high segmentation fault cases in this section.
Figure 5.2 shows a case of low fault tolerance, about 91.4% incorrect behavior, in
164.gzip. This is a function called updcrc() from the trace which does a crc check and
hence is compute intensive. Comparing this layout to the high fault tolerant case, it
does not have binary decisions or logic to mask values off. Instead the operations just
before returning from the function perform an xor with all ”F’s” which necessarily flips
all bits. From the context point of view this information will never match after a fault
injection.
Figure 5.3 shows a case of high segmentation fault. In this case it is a function
from 300.twolf. The source code shows that this function performs a linked list traversal,
within which it assigns values. Also, most of the assignments use pointers, hence a very
likely candidate for misalignment in memory accesses.
The study shown above, requires properties closer to the source, based on their
impact on fault tolerance. Two such properties and their effects on the fault susceptibil-
29
/* Source function */ulg updcrc(s, n) uch *s; /* pointer to bytes to pump through */ unsigned n; /* number of bytes in s[ ] */ { register ulg c; /* temporary variable */ static ulg crc = (ulg)0xffffffffL; /* shift register contents */ if (s == NULL) { c = 0xffffffffL; } else { c = crc; if (n) do { c = crc_32_tab[((int)c ^ (*s++)) & 0xff] ^ (c >> 8); } while (--n); } crc = c; return c ^ 0xffffffffL; /* (instead of ~c for 64-bit machines) */}
(a) Source code (b) Dynamic control flow
Figure 5.2: A trace from 164.gzip with 91.4% incorrect results of emulation.
ity of a program have been chosen. The following sections analyze code region similarity
and data dependency of a program as tools to correlate its fault tolerance.
5.2 Dynamic Program Trace Behavior
This section analyzes the behavior of a program based on similarity of code re-
gions and its effect on fault tolerance of the application. The analysis method comprises
of comparing the existing traces with each other, in an attempt to find matches in pro-
gram counter values. This gives an indication of the program executing in the same
code region but probably in a different dynamic state.
30
(a) Source code (b) Dynamic control flow
Figure 5.3: A trace from 300.twolf with 100% segmentation fault results of emulation.
The correlation with fault tolerance is based on generating a score of similarity
among a pair of traces. This score would give an indication of how similar the two
traces are. The score generation procedure comprises of taking two traces, finding the
percentage of matching program counters from one trace to the other and vice versa,
and averaging their value to generate a similarity score. This procedure is iterated in
pairs of traces among all the traces. Figure 5.4 shows the overall correlation of fault
tolerance to PC similarity across all benchmarks. The different bins represent the delta
31
in fault tolerance with similarity scores of above 65%.
Figure 5.4: Fault tolerance correlation with program counter similarity.
To observe the correlation, the difference in the fault tolerance of a pair of traces
with similarity scores of 65% and above are plotted as shown in figure 5.5. The figure
shows program counter similarity score plot against a function of the fault tolerance
delta between the pair of traces. The curve observed is a polynomial fitting curve of
third degree depicting a trend of the susceptibility of a program to faults with similarity
of code regions.
There are 2 observable trends from the figures of 175.vpr and 181.mcf which
show a very promising decrease in the fault tolerance delta with increasing score of
similarity. On the other hand compression algorithms like 164.gzip and 175.vpr show
a very irregular behavior. This could be due to the large variations in the dynamic
states, though similar instructions are being executed, but at a different time. Also
these applications tend to have more inter-procedural calls than others.
5.2.1 Inter-Procedural Fault Tolerance
32
(a) 175.vpr (b) 256.bzip2
(c) 181.mcf (d) 164.gzip
Figure 5.5: Fault susceptibility correlation calculated using similar program code points.
The existing trace information was used to correlate the effect of number of pro-
cedures in a trace to its fault susceptibility. As expected, a decreasing trend of fault
tolerance was observed with increasing number of inter procedural calls within a trace.
Figure 5.6 shows the average fault tolerance across all traces with increasing number of
procedures per trace. The figure also shows the percentage of times these occur among a
set of 1000 traces and is scaled with a factor of 0.2 for visibility in the same region. The
occurrence of single functions within traces dominates the rest, showing fewer traces
with higher number of procedures having the highest fault susceptibility.
The above study gives us some indication that though certain applications show
a favorable trend of fault tolerance with code similarity, others depend on the dynamic
state of the system. This is taken into consideration in the following section where data
33
Figure 5.6: Inter-procedural fault tolerance.
dependencies which are the primary propagators of a fault are analyzed and compared
among traces.
5.3 Dynamic Dependence Graph Representation
Program counter correlation captures the similarity in terms of functional execu-
tion, but fails to capture the dynamic data flow, which propagates or masks the effect
of a transient fault. The propagation of a fault depends on the data dependencies of a
program, and this information needs to be captured and used to find similar trends, in
order to correlate with the fault susceptibility of an application. Figure 5.7 shows an
example of a data dependency graph in ’dot’ format. The reliability of the system is
inherent in the connectivity of the graph.
For this purpose, data dependency graphs are generated from the traces. The
graphs are generated by a graph clustering tool developed by Dennis Sasha at NYU [8].
The tool takes a data set of directed graphs generated by a PIN [21] tool which in-
struments and dumps data dependency information. The graph clustering tool uses
SUBDUE [14] to find common sub-structures in a given trace. The tool then iterates
34
Figure 5.7: Example: Dependency graph.
over all traces to find t similar structures and clusters them together. The tightness
measure of a cluster which defines the precision of similarity can also be defined.
The above graph clustering procedure was iterated over all the generated traces.
A similarity score was calculated among a pair of traces based on the closeness of the
dependence graph contained by each trace. This analysis directly correlates the fault
tolerance with the dynamic dependencies of the program, which is the active ingredient
in fault propagation. For a larger view, similarity scores of 35% and above have been
plotted against their respective fault tolerances. The graph on the left gives the actual
raw number while that on the right is a polynomial fitting of the graph on the left.
The cases in Figure 5.8 are the same as shown for program counter similarity,
yet show a decreasing trend in fault tolerance with increasing similarity score of depen-
dences. A similar trend was observed among all the other benchmarks with very few
outliers.
Figure 5.9 shows the dependence similarity correlation over all the benchmarks.
from the figure, it can be observed that the trend remains similar averaged over all the
35
(a) 164.gzip. (b) 164.gzip - Trend.
(c) 175.vpr. (d) 175.vpr - Trend.
Figure 5.8: Dependence graph similarity correlation.
(a) All Benchmarks. (b) All Benchmarks - Trend.
Figure 5.9: Dependence graph similarity correlation - All benchmarks.
benchmarks. The delta in fault tolerance shows a decrease with increasing similarity.
Although a slight rise in the curve can be observed in the end.
36
5.4 Estimating Program Fault Tolerance
Based on the two correlation techniques seen in the previous sections, it is pretty
evident that dependency graph similarity is a much more accurate analysis of correlation
with fault tolerance. This technique is put to use in prediction of fault tolerance based
on the constant trends of fault susceptibility seen against graph similarity scores. If
the prediction accuracy does fall under an acceptable range, one could only look at the
dependence structure of any program and be able to predict its fault tolerance without
any injections.
The existing results for the traces from the emulation framework along with their
dependency graphs were used to make prediction models. The prediction model used
here uses incremental number of graphs which are most similar to the reference graph,
in order to predict its fault tolerance. The fault tolerance values are averaged for graph
number more than one. Ten models have been selected for this prediction, where the
first model predicts the fault tolerance of a trace based on the fault tolerance of another
trace which has the most similar dependency graph. The models greater than two, select
the designated number of most similar graphs and average out their fault tolerance score
for prediction. The accuracy of the prediction model is observed with increasing number
of similarity graphs in use.
Figure 5.10 gives the accuracy of each prediction model averaged over all the
traces on 12 benchmarks. The X axis represents the model with the number of simi-
larity graphs used for its prediction and the Y axis represents the average accuracy of
prediction. Since the Y axis is a difference between the actual and predicted value of
fault tolerance averaged over all traces, a value closer to the X axis represents higher
prediction accuracy.
Based on the figure, it is evident that the accuracy in prediction of fault tolerance
37
Figure 5.10: Fault tolerance prediction.
increases with the number of similarity graphs used for the model. This model for
prediction uses an averaging approach. Better models need to be studied to improve
the accuracy of prediction. But this study gives us a baseline on which one can make
estimates of the fault susceptibility of any application based on the behavior of its
dependency graphs.
Chapter 6
Future Work
The TEFIS framework does justify a fractional execution system with accuracy
close to the full injection methodology. But further investigation in the trace length of
execution is required. Longer the trace length of execution, closer would be its emulation
accuracy to full injections, but with a compromise in the speed of execution. Generally
one would expect an increasing curve of accuracy as shown in Figure 4.6 with increasing
trace length for any application. Future work would include looking at these curves for
various applications.
This thesis presented an understanding of the fault tolerance of a program by
looking at its source code and determining possible points of vulnerability. Observations
were made based on logical masking effects that showed a large percentage of injections
which could prove to be ineffective. Addition of this feature for selective injections based
on this study, would be fairly easy with the existing framework.
Dependency graphs seem to be a good representative for correlation with the fault
tolerance of a program. Precision of the cluster formation based on levels of similarity
in graphs is something that definitely requires further investigation. Also, the fault
tolerance predictions made in this thesis are based on an averaging method. Better
solutions may include using a larger set of graphs with method of prediction based on
standard deviation.
39
Application of this framework and correlation methodology to a number of com-
piled versions of the executable would be an interesting study to observe the effects a
compiler could make on the application as shown in figure 5.1.
Chapter 7
Conclusion
Semiconductor transient faults (soft errors) are a critical design concern in the
reliability of computer systems. Most recent architecture research is focused on using
performance models to provide Architecture Vulnerability Factor (AVF) estimates of
processor reliability rather than deploying detailed fault injection into hardware RTL
models. While AVF analysis provides support for investigating new fault tolerant ar-
chitecture techniques, program execution characteristics are largely missing from the
determination of periods of soft error susceptibility. The primary problem with AVF is
that software periods of vulnerability substantially differ from micro-architecture peri-
ods of vulnerability. As research trends dictate finding ways to selectively enable tran-
sient fault tolerant mechanisms, run-time and off-line experimental techniques must be
guided equally by program behavior and hardware.
Bibliography
[1] Multiple Instruction Issue in the NonStop Cyclone Processor, 1990.
[2] S. Baloch, T. Arslan, and A. Stoica. Design of a single event upset (seu) mitiga-tion technique for programmable devices. In ISQED ’06: Proceedings of the 7thInternational Symposium on Quality Electronic Design, pages 330–345, Washing-ton, DC, USA, 2006. IEEE Computer Society.
[3] Robert C. Baumann. Soft errors in commercial semiconductor technology:Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes,Reliability Fundamentals, pages 121 01.1 – 121 01.14, April 2002.
[4] Shekhar Borkar. Designing reliable systems from unreliable components: The chal-lenges of transistor variability and degradation. IEEE Micro, 25(6):10–16, 2005.
[5] Brad Calder and Greg Hamerly and Tim Sherwood. Simpoint.
[6] J. Adam Butts and Guri Sohi. Dynamic dead-instruction detection and elimination.In Proceedings of the tenth international conference on Architectural Support forProgramming Languages and Operating Systems, 2002.
[7] Jeffrey J Cook and Craig Zilles. A characterization of instruction-level deratingand its implications for error detection. 2008.
[8] Dennis shasha and Diego Reforgiato Recupero. Graphclust.
[9] Mohamed Gomaa and et al. Transient-fault recovery for chip multiprocessors. InISCA, 2003.
[10] Mohamed A. Gomaa and T. N. Vijaykumar. Opportunistic transient-fault detec-tion. In ISCA ’05: Proceedings of the 32nd annual international symposium onComputer Architecture, pages 172–183, Washington, DC, USA, 2005. IEEE Com-puter Society.
[11] S. Hareland and et al. Impact of CMOS Scaling and SOI on Software Error Ratesof Logic Processes. In VLSI Technology Digest of Technical Papers, 2001.
[12] John ellson and Emden Gansner and Yehuda Koren and Eleftherios Koutsofios andJohn Mocenigo and Stephen North and Gordon Woodhull.
42
[13] T. Karnik and et al. Scaling Trends of Cosmic Rays Induced Soft Errors in StaticLatches Beyond 0.18µ. In VLSI Circuit Digest of Technical Papers, 2001.
[14] Nikhil S. Ketkar, Lawrence B. Holder, and Diane J. Cook. Subdue: compression-based frequent pattern discovery in graph data. In OSDM ’05: Proceedings of the1st international workshop on open source data mining, pages 71–76, New York,NY, USA, 2005. ACM.
[15] Kevin M. Lepak and Mikko H. Lipasti. Silent stores for free. In MICRO33: Proceedings of the 33rd annual ACM/IEEE international symposium onMicroarchitecture, pages 22–31, New York, NY, USA, 2000. ACM.
[16] M. Lipasti. Value locality and speculative execution, 1997.
[17] Sarah E. Michalak and et al. Predicting the Number of Fatal Soft Errors inLos Alamos National Laboratory’s ASC Q Supercomputer. IEEE Transactionson Device and Materials Reliability, 5(3):329–335, September 2005.
[18] Shubhendu S. Mukherjee and et al. Detailed design and evaluation of redundantmultithreading alternatives. In ISCA, 2002.
[19] Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt,and Todd Austin. A systematic methodology to compute the architectural vul-nerability factors for a high-performance microprocessor. In Proceedings of the36th Annual IEEE/ACM International Symposium on Microarchitecture, page 29.IEEE Computer Society, 2003.
[20] Karthik Pattabiraman, Nithin Nakka, Zbigniew Kalbarczyk, and RavishankarIyer. Symplfied: Symbolic program-level fault-injection and error-detection frame-work. In Proceedings of the International Conference on Dependable Systems andNetworks (DSN). IEEE, June 2008. This paper won the William Carter award forbest paper.
[21] PIN Dynamic Instrumentation Tool. http://rogue.colorado.edu/pin/.
[22] Maurizio Rebaudengo, Matteo Sonza Reorda, Marco Torchiano, and MassimoViolante. Soft-error detection through software fault-tolerance techniques. InDFT ’99: Proceedings of the 14th International Symposium on Defect andFault-Tolerance in VLSI Systems, pages 210–218, Washington, DC, USA, 1999.IEEE Computer Society.
[23] Steven K. Reinhardt and Shubhendu S. Mukherjee. Transient fault detection viasimultaneous multithreading. SIGARCH Comput. Archit. News, 28(2):25–36, 2000.
[24] George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S.Mukherjee. Configurable transient fault detection via dynamic binary translation.In Proceedings of the 2nd Workshop on Architectural Reliability (WAR), December2006.
[25] George A. Reis and et al. SWIFT: Software implemented fault tolerance. In CGO,2005.
43
[26] Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, andLorenzo Alvisi. Modeling the effect of technology trends on the soft error rateof combinational logic. In Proceedings of the 2002 International Conference onDependable Systems and Networks, pages 389–399, June 2002.
[27] Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseph Blomstedt, and Daniel A.Connors. Using process-level redundancy to exploit multiple cores for transient faulttolerance. In DSN ’07: Proceedings of the 37th Annual IEEE/IFIP InternationalConference on Dependable Systems and Networks, pages 297–306, Washington,DC, USA, 2007. IEEE Computer Society.
[28] Timothy J. Slegel and et al. IBM’s S/390 G5 Microprocessor design. In IEEEMicro, volume 19, pages 12–23, March 1999.
[29] Nicholas Wang, Michael Fertig, and Sanjay Patel. Y-branches: When you come toa fork in the road, take it. In PACT ’03: Proceedings of the 12th InternationalConference on Parallel Architectures and Compilation Techniques, page 56, Wash-ington, DC, USA, 2003. IEEE Computer Society.
[30] Nicholas J. Wang, Justin Quek, Todd M. Rafacz, and Sanjay J. patel. Characteriz-ing the effects of transient faults on a high-performance processor pipeline. In DSN’04: Proceedings of the 2004 International Conference on Dependable Systems andNetworks, page 61, Washington, DC, USA, 2004. IEEE Computer Society.
[31] Y.C. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedingsof the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293–307,February 1996.