Architectural vulnerability factor estimation through ...cj82n890p/fulltext.pdfAs GPU hardware vendors introduced programmable shaders, AMD and NVIDIA intro-duced support for OpenCL

Architectural Vulnerability Factor Estimation through Fault

Injections

A Thesis Presented

by

Fritz Gerald Previlon

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Master of Science

in

Computer Engineering

Northeastern University

Boston, Massachusetts

April 2016

To my family!!

i

Contents

List of Figures iv

List of Tables v

List of Acronyms vi

Acknowledgments vii

Abstract of the Thesis viii

1 Introduction 11.1 Introduction to GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The Open Compute Language (OpenCL) . . . . . . . . . . . . . . . . . . 31.1.2 The OpenCL Platform Model . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Architecture of the Evergreen family of GPUs . . . . . . . . . . . . . . . 51.1.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.5 Local Data Storage (LDS) . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 Soft Error Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Faults vs Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Correctable Errors (CE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Detected Unrecoverable Errors (DUE) . . . . . . . . . . . . . . . . . . . . 92.1.4 Silent Data Corruptions (SDC) . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Transient Fault - Background and Terminology . . . . . . . . . . . . . . . . . . . 92.2.1 Raw Circuit Fit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Vulnerability Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Architectural Vulnerability Factor . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Architecturally Correct Execution (ACE) Analysis . . . . . . . . . . . . . 12

ii

3 Transient Faults on GPUs 143.1 Fault Injection Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 GPU-Qin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 SASSIFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.3 Multi2Sim Fault Injection [1] . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 ACE Analysis studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 GPGPU-SODA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Methodology 184.1 Multi2Sim simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Multi2Sim for Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Post-Experiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5.1 Platform for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5.2 Evaluated Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Results and Analysis 245.1 Local Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 How does vulnerability vary over time? . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.1 Case Study: LDS - RadixSort . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 Case Study: MatrixMultiplication . . . . . . . . . . . . . . . . . . . . . . 32

6 Conclusion 35

Bibliography 36

iii

List of Figures

1.1 Clock rate and Power for Intel x86 microprocessors over eight generations and 25years (Source [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 OpenCL Programming Model and Evergreen Hardware Architecture. . . . . . . . 4

4.1 Possible outcomes for each simulation run . . . . . . . . . . . . . . . . . . . . . . 194.2 This graph shows how the Architectural Vulnerability Factor (AVF) value changes

based on the number of fault-injection experiments. We notice that the AVF valueshows little variation and stabilizes after 5,000 injections. . . . . . . . . . . . . . . 20

4.3 Formula for the number of faults to select for injection . . . . . . . . . . . . . . . 21

5.1 Results of Fault injection experiments on the Local Data Share . . . . . . . . . . . 255.2 Amount of local memory used by each application for each NDRange mapped to the

compute device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Results of Fault injection experiments on the General Purpose Register File . . . . 275.4 Number of General-Purpose registers used by each application for each NDRange

mapped to the compute device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5 Intervals of Vulnerability for Radixsort. This shows that the faults that lead to

incorrect output fall only into specific intervals of time . . . . . . . . . . . . . . . 305.6 Intervals for LDS accesses for Radixsort. . . . . . . . . . . . . . . . . . . . . . . 315.7 Intervals of Vulnerability for MatrixMultiplication. This vulnerability of MatrixMul-

tiplication shows a periodic behavior . . . . . . . . . . . . . . . . . . . . . . . . . 325.8 Local Memory accesses in MatrixMultiplication show a periodic behavior . . . . . 33

iv

List of Tables

4.1 The GPU configuration used in the experiments . . . . . . . . . . . . . . . . . . . 224.2 The benchmarks used in the experiments . . . . . . . . . . . . . . . . . . . . . . . 23

v

List of Acronyms

AVF Architectural Vulnerability Factor. Probability that a soft error will cause an error in a programoutput. The AVF of an architectural bit can be thought of as the Fraction of time a bit mattersfor final output of a program

ACE Architecturally Correct Execution. ACE Analysis is a method to derive an upper bound onAVF using performance simulation.

LDS Local Data Storage Local Memory module in a compute unit of a GPU. This module is sharedbetween work-items in a compute unit and allows for communication between the work-items.It can be manipulated using explicit instructions.

vi

Acknowledgments

Here I wish to thank those who have supported me during the process of the thesis work.First I would like to thank my family and close friends who have encouraged me and

believed in me. Their support was critical to the completion of this work.I also want to think the friends from NUCAR and my committee members (Dr. Vilas

Sridharan and Prof. Ningfang Mi) who have provided important and constructive feedback throughoutthe progress of this thesis.

Lastly, my advisor David Kaeli has been a reliable and indispensable guide and supportfrom the beginning of this work. Many thanks!

vii

Abstract of the Thesis

Architectural Vulnerability Factor Estimation through Fault Injections

by

Fritz Gerald Previlon

Master of Science in Electrical and Computer Engineering

Northeastern University, April 2016

David Kaeli, Adviser

Given the large number of processing cores, as well as their impressive parallel processingcapabilities, Graphic Processing Units (GPUs) have become the accelerator of choice across multipledomains. GPUs are able to accelerate processing in a wide range of applications including scien-tific computing, bio-informatics, and financial applications. Their presence in the world’s fastestsupercomputers has been steadily growing over the last few years.

With technology scaling, soft-error reliability has become a major issue for hardwaredesigners. Soft-errors are a non-permanent fault, where a bit flip occurs in a latch or memory cell. Arecent study by the Department of Energy has identified soft errors as one of the top 10 barrier toexascale computing. The architecture research community needs to pursue solutions to address thechallenges presented by the growing presence of soft errors. While some number of soft errors willnot necessarily cause an error at the output of a program, many will corrupt vulnerable program state.Since GPUs are increasingly being used for compute instead of just graphics, their reliability hasbecome a concern. Therefore, an important step in tackling soft errors in GPUs is to first assess theimpact of soft errors and the robustness of the GPUs in the presence of these faults.

In this thesis, we evaluate this question using fault injection on an AMD Evergreen familyof GPUs. In this study, we inject bit flips using a detailed architectural simulator. Our results indicatethat a GPU can be a highly resilient device to soft errors. We present a study of trends that appear incommon GPU programs when soft errors occur in GPU memory hierarchy. These trends can be usedto inform programmers, as well as system designers, when making decisions about how to increasethe reliability of GPU software and hardware.

viii

Chapter 1

Introduction

For more than 3 decades, frequency scaling - increasing a processor’s frequency for

performance - has been the driving force behind Moore’s law [3]. Processor frequencies have

increased from 1-8 MHz in the 1970s to 2-4 GHz today (approximately 4,000 times faster). How-

ever, power/thermal constraints have made it very challenging for us to continue increasing clock

frequencies of microprocessors. Figure 1.1 shows how both power and clock rate have increased

rapidly for decades, but have recently flattened [2]. The microprocessor industry has thus turned

to parallelism in order to obtain higher performance using the same frequency, though with only a

minimal increase in power consumption. For the past decade, general purpose compute applications

have started to leverage the parallelism provided through parallel computing hardware, as well as

sophisticated parallel programming interfaces.

As parallelism became more prevalent, the market saw an increase in multi-core processors,

able to take advantage of the parallelism inherent in general-purpose applications. New program-

ming interfaces have been developed in order to facilitate the development of parallel applications.

Developers have looked for ways to accelerate their performance-critical applications in order to

exploit the performance benefits offered by the parallelism in multi-core processors.

With hundreds of cores and streaming processing devices, Graphics Processing Units

(GPUs) have become an attractive parallel processing device. Originally, GPU acceleration was

limited due to the lack of programmable shaders, then the use of graphics-oriented programming

languages. Improved programmability has helped these devices become quite attractive for high-

performance computing and other data-intensive applications. Beyond their primary graphics role,

they are now are used in a growing range of applications, including scientific computing [4], bio-

informatics [5], molecular modeling [6] and financial applications [7].

1

CHAPTER 1. INTRODUCTION

Figure 1.1: Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years(Source [2])

However, because their primary use has been for graphic processing, reliability (or more

specifically, soft error reliability) has never been a major concern for GPU designers. As expressed

by Sheaffer et al., a user is quite unlikely to care about or even perceive a single-bit error in a single

pixel for a single frame, when running traditional gaming programs [8].

To continue to exploit the GPU impressive parallel compute capabilities, and expand the

use of GPUs to a wider range of markets and industries, it is imperative that reliability issues on

GPUs be rigorously addressed.

In a traditional CPU design, soft-error reliability is not a foreign concept. Reliability has

commonly been a key design trade-off considered during processor design. Soft errors are radiation-

induced errors and are caused by energetic particles (neutrons from cosmic rays, and alpha particles

from packaging materials) generating electron-hole pairs as they pass through semiconductor devices.

Soft-error reliability has been well studied on CPUs; numerous techniques have been developed

to characterize errors, and to protect microprocessors against these faults [9][10][11]. However,

little work has been done on the resiliency of GPUs in the presence of soft errors. We need to first

understand how vulnerability of these devices is dependent on underlying program characteristics.

In this thesis, we present an extensive fault injection study on soft error reliability in the memory

hierarchy of a class of GPUs.

We have found that, in general, a GPU is fairly resilient to soft errors when running typical

2


GPU applications. We have also found trends in the resiliency of a GPU, which can be exploited by

GPU application designers to make their software more robust against soft errors.

Next, we provide background information on General-Purpose Computing on Graphics

Processing Units (GPGPU).

1.1 Introduction to GPU Programming

GPUs were originally designed to efficiently render 3-D graphics, providing highly opti-

mized datapaths for generating frames and frames of pixel data. The research community recognized

that GPUs could also be used for massive data processing, and started executing floating-point

computations using shader languages such as OpenGL and DirectX. The applications that were first

ported to GPUs typicallly involved compuations on matrices and vectors. Matrix multiplication was

one of the early CPU programs that performed significantly better when run on a graphics card [12].

However, porting these general-purpose applications to GPUs was a very complex and daunting task,

as it required that the programmer to recast their algorithms in terms of the graphics APIs.

Industry leaders AMD and Nvida recognized this trend, and proposed general purpose

programming languages that would allow GPUs to be used for a broader class of applications.

OpenCL [13] and CUDA [14] have emerged as two standard programming frameworks that allow

GPUs to be integrated in supercomputers and desktops as accelerators. Programmers were no longer

tied to the underlying graphics programming model. They could focus more on high-performance

computing, which attracted many more developers of general purpose applications to a GPU platform.

1.1.1 The Open Compute Language (OpenCL)

As GPU hardware vendors introduced programmable shaders, AMD and NVIDIA intro-

duced support for OpenCL and CUDA, respectively. These C-like parallel programming frame-

works provide a Software Development Kit (SDK) that includes a rich set of APIs and compil-

ers/runtimes/drivers. In this thesis we use programs written in OpenCL [13].

OpenCL is an emerging framework for programming heterogeneous devices. It is an

industry standard maintained by Khronos, a non-profit technology consortium. OpenCL has seen

an increasing number of adoptions from major vendors in industry, including Apple, AMD, ARM,

NVIDIA, Intel, Imagination Technologies, Qualcomm and S3. OpenCL provides a number of

abstraction models, allowing the model to be applied to a wide range of system architectures. In

3


a) Elements defined in the OpenCL programming model. Work-items running

the same code form work-groups, which in turn, compose the whole ND-Range.

b) Simplified block diagram of the Radeon HD 5870 hardware architecture. This GPU

belongs to the Evergreen family of AMD devices.

Figure 1.2: OpenCL Programming Model and Evergreen Hardware Architecture.

the OpenCL terminology, the GPU is referred to as the device and the CPU as the host. Figure 1.2

presents an overview of the OpenCL programming model.

1.1.2 The OpenCL Platform Model

The platform model for OpenCL consists of a host connected to one or more OpenCL

devices. Each device consists of one or more compute units (CU) and each compute unit further

consists of one or more Processing Elements (PE). Within a device, the computations are performed

within the processing elements. An OpenCL application runs on the host and sends commands to the

device be executed by the processing elements. OpenCL’s programming model emphasizes parallel

4


processing by assuming a Single Program Multiple Data (SPMD) paradigm, in which a piece of

code, (called a kernel) maps to multiple subsets of input data, creating a massive number of parallel

threads.

The host program is the starting point for the OpenCL program and executes on the CPU.

The device kernel is written in OpenCL. In most cases, the OpenCL kernel runs on a GPU device

and is usually compiled during the execution of the OpenCL host program.

An instance of the kernel executing in a Processing Element is called a work-item and is

identified by a global ID. Each work-item executes the same code, but the specific execution path

can vary per work-item by querying its ID.

Work-items are organized into work-groups. Work-groups are assigned a work-group ID.

Work-items within a work-group are also assigned a unique local ID. A set of work-groups in turn

form an ND-Range, which is a grid of work-item groups that share a common global memory space.

1.1.3 Architecture of the Evergreen family of GPUs

In this thesis, we have worked with the AMD Evergreen family of GPUs to evaluate

soft-error resiliency in GPUs. The Evergreen family was an earlier flagship GPU developed by

AMD. This device was designed to target general-purpose data-intensive applications, along with

the primary graphics applications. While the Evergreen devices are a few years old, they support

OpenCL exection.

Figure 1.2b shows the general systems architecture of the Radeon HD 5870 GPU, a popular

GPU in the Evergreen family. The GPUs in this family have computational units called compute

units that can take advantage of data parallelism.

The Radeon 5870 has 20 compute units. Each compute unit has 16 stream cores. Each

stream core in a compute unit is devoted to the execution of one instance of an OpenCL kernel. The

stream cores also have access to a 32KB local data storage. Additionally, each stream core has 5

processing elements that execute the machine instructions.

Interestingly, the stream cores in Evergreen are time-multiplexed in 4 slots. This gives

the illusion that each stream core is running 4 different kernels simultaneously. Furthermore, the

Evergreen architecture has support for 5-way Very Long Instruction Word (VLIW) bundles of

arithmetic instructions. The hardware support is provided in each stream core in the form of the 5

processing elements, labeled x, y, z, w and t. As a result, the Radeon 5870 GPU has the ability to

issue up to 5 floating-point operations in one cycle.

5


When an OpenCL kernel is launched on an Evergreen GPU, the ND-Range is initially

transferred to it. A dispatcher processes the ND-Range and assigns work-groups to any of the

available compute units in any order. Each compute unit contains a set of 16 stream cores, each

devoted to the execution of one work-item. All stream cores within the compute unit have access to a

common local data storage (LDS), used by the work-items to share data at the work-group level. The

LDS is the implementation of the local memory concept as defined in OpenCL. Finally, each stream

core contains 5 processing elements to execute Evergreen machine instructions in a work-item, plus

a file of general-purpose registers, which provides the support for the private memory concept as

defined in OpenCL.

The GPU memory hierarchy is divided into three memory scopes: 1) private memory (the

register file), 2) local memory (the local data storage), and 3) global memory. Access to each memory

scope is defined by software.

In this thesis, we focus on the vulnerability of the first two memory scopes in the GPU, the

Local Data Storage and the Vector Register File. These structures represent a large portion of a GPU

chip and can be directly addressed by a programmer. It is crucial for a programmer to understand

how to use these memory scopes when resilience is critical to an application. We provide a brief

description of each of these two structures in the following paragraphs.

1.1.4 Register File

The register file of a compute unit can be considered its private memory as defined by the

OpenCL programming model. The register file provides each work-item in a work-group that is

mapped to a compute unit at a given time with their private copy of register values. The register file

can be accessed and modified by specific instructions, the Arithmetic Logic Unit (ALU) instructions

and the TEX (fetch through a texture cache) instructions during their read or write stages of execution.

The register file in a GPU is significantly larger than that of a CPU. Moreover, since GPUs

are throughput-oriented devices, they can usually have hundreds of threads (work-items) running

concurrently. GPUs utilize fine-grained scheduling among the individual threads to hide latencies

which can be associated with memory operations and dependencies from these threads. Having such

a large number of threads and being a throughput-oriented device, a GPU needs to have dedicated

hardware to support each running thread in the device. This explains the motivation for a large

register file (about 64 KB in the Radeon 5870).

6


1.1.5 Local Data Storage (LDS)

As previously explained, the register file provides private memory for the individual threads

running on a GPU. Each thread is provided with its own set of separate registers. Communication

between individual threads through the register file is therefore not allowed. However, the OpenCL

programming model supports the sharing of data between work-items within a work-group.

The GPU uses local memory in order to support this feature. Each compute unit contains

one local memory module that is accessible by all work-items that are running in the work-group.

The local memory functionality is different than that of the cache in the CPU. In a GPU, data in local

memory is manipulated using explicit instructions, and the size of the local memory is comparable to

the register file size (32 KB in the Radeon 5870).

1.2 Contributions of this Thesis

The contributions of this thesis include:

1. We present a reliability study of the vector register file and the local data share of a GPU.

2. We simulate the presence of single-bit faults using fault injection and carry out a study by

simulating GPU workloads on a cycle-based simulation model of the AMD Radeon 5870.

3. We provide a characterization of the resiliency of a suite of OpenCL kernels to the effects of

particle-induced faults.

4. We observe how the vulnerability of applications change over time and provide insights that

can be used by application developers to reduce vulnerability.

1.3 Organization of the Thesis

This thesis is organized as follows. In Chapter 2) we discuss prior work on reliability

modeling. In Chaper 3 we review the limited prior work on GPU reliability. Chapter 4 describes

the framework we use for our fault injection experiments, as well as the details oft our fault model.

We also discuss the Architectural Vulnerability Factor of the applications that we use. Chapter 5

provides the results of our simulation study, Chapter 6 summarizes lessons learned in this thesis, and

discusses directions for future work.

7

Chapter 2

Background

In this chapter, we provide background information on Soft Errors and the methods used to

deal with these errors. We discuss techniques and paradigms used at the architectural level to assess

the error rate of a processor, then we discuss recent reliability work for general purpose GPUs.

2.1 Soft Error Overview

Soft errors are intermittent malfunctions of the hardware that cannot be reproduced. Soft

errors are dynamic and are changes to a cell’s contents, rather than a change in the circuitry. They

are caused by single event upsets (SEUs) which are most often the result of particle strikes on silicon

devices. Among the most common particles that produce Soft Errors are neutrons from cosmic rays

and alpha particles from packaging materials.

When these strikes occur, the particles are able to inject charge into the devices which can

alter values in the devices. Each cell in a device has a minimum charge needed to change the stored

value in the cell. This minimum charge is called the critical charge f (Qcrit) for that cell. Following a

particle strike, if the accumulated charge exceeds the critical charge of the cell, a Soft Error occurs.

In short, particle strikes which generate a charge higher than Qcrit will cause a Soft Error.

2.1.1 Faults vs Errors

A fault is an undesired state change in hardware. A fault in a particular layer in the

computing stack may propagate to the next layer. The undesired state change in the next layer is

termed an error. In this thesis, we use the term transient faults for the soft errors, defined in 2.1.

8

CHAPTER 2. BACKGROUND

When a transient fault occurs in a bit, this bit can be overwritten to remove the fault. When the bit is

not overwritten, the incorrect state that happens as a consequence of this fault is termed an error.

Errors can be classified based on their impact on the system. We identify Correctable

Errors, Detected Unrecoverable Errors or Silent Data Corruptions.

2.1.2 Correctable Errors (CE)

Correctable Errors are errors from which the system is able to recover from and return to

normal operation. This is usually made possible through either hardware or software. Because the

system is able to recover from the effect of these errors, they are not usually not a cause of concern.

Many vendors however, use the reported rate of Correctable Errors as a warning that a system may

have an impending hardware problem [15].

2.1.3 Detected Unrecoverable Errors (DUE)

Detected Unrecoverable Errors are errors that will be discovered either through a program,

operating system, or hardware. These errors are typically reported to the system and very often the

system cannot recover. They often cause a system to crash.

2.1.4 Silent Data Corruptions (SDC)

Silent Data Corruptions are undetected errors that alter data in a system without being

detected, and ultimately permanently corrupt program states or user data. Because they can cause a

program to produce incorrect results without the knowledge of the user, these are the most undesirable

errors.

2.2 Transient Fault - Background and Terminology

In order to deal with transient faults, microprocessor vendors often establish an error

budget for each design. Designers then perform extensive analysis to ensure that a design meets these

target budgets. Vendors express their error budget in terms of Mean Time Between Failures (MTBF).

For example, for its Power4 processor-based systems, IBM targets 1,000 years system MTBF for

SDC errors, 25 years system MTBF for DUE errors that lead to system crash and 10 years system

MTBF for DUE errors that lead to application crash [16].

9


Another commonly used measurement unit for error rates is Failures in Time or FIT, which

is inversely related to MTBF. One FIT corresponds to one failure in a billion hours. Therefore, 1,000

years MTBF equals 114 FIT(109/(1000 ∗ 365 ∗ 24)). The same way, zero FIT means that there is

infinite time between failures (infinite MTBF). Designers prefer to work with FIT as opposed to

MTBF because it is an additive unit, unlike MTBF.

To evaluate whether a chip meets its FIT target, designers use sophisticated computer

models. The effective FIT rate of a structure in a chip is the product of two metrics, its raw circuit

FIT rate and its vulnerability factor.

2.2.1 Raw Circuit Fit Rate

The raw circuit FIT rate of a cell also called intrinsic FIT rate is its device-level Transient

Fault Rate and includes any extra derating such as the ones that may be necessary for a dynamic cell.

2.2.2 Vulnerability Factor

Vulnerability factor (also called derating factor or soft error sensitivity factor) is an in-

dication of the probability that an internal fault in a device will result in an externally-visible

error.

Several vulnerability factors affect the FIT rate of a structure. Timing vulnerability factor

for example measures the percentage of time a fault in a structure will lead to an externally-visible

error. A strike in the stored bit of a level-sensitive latch may not cause an external error if the strike

occurs while the latch was accepting data. The stored bit will be overridden by the entering data.

Assuming that the latch was receiving data 50% of the time, its Timing Vulnerability Factor is 50%.

Several other vulnerability factors affect the effective FIT rate of a structure. However, in

this work, we are assuming that all vulnerability factors, except the architectural vulnerability factor

are incorporated into the raw circuit FIT rate.

2.3 Architectural Vulnerability Factor

The Architectural Vulnerability Factor of a structure is defined as the percentage of bits in

the structure that are necessary for correct program execution over the lifetime of a simulation. It

expresses the probability that a bit flip in the structure will produce a visible incorrect result at the

output of a program.

10


Current predictions show that the overall raw FIT rate per bit will remain constant for

the next several technology generations. Therefore, it is crucial to focus efforts on reducing the

architectural vulnerability factor of a chip to make the chip more reliable and competitive.

There are two common methodologies for assessing AVF in silicon devices: fault injection

and ACE analysis [17][18][9]. These two methods help designers analyse the AVF of an architecture

in various stages of the design process.

2.3.1 Fault Injection

Fault injection is the most widespread method for assessing reliability. A fault injection

campaign compares the reference behavior of the circuit for a given workload (that is, the correct

behavior validated by the designer) with the behavior obtained in the presence of each fault in a

predetermined set [19].

In a fault injection campaign, a fault is injected in a hardware structure at a random time

and at a random location, while a workload is being executed on the device under test. The output of

the workload is then examined against a golden output to determine whether the injected fault caused

a visible failure. This process is then repeated a number of times and as the number of runs becomes

statistically significant, the ratio between the number of failing runs to the total number of runs will

converge towards the Architectural Vulnerability Factor of the structure. Hardware-implemented

fault injection and software-implemented fault injection are the most common approaches used to

perform fault injection.

2.3.1.1 Hardware-implemented Fault Injection

In this method, faults are inserted into the actual device silicon by either using a dedicated

custom hardware [20] or by injecting the faults into integrated circuits using heavy-ion radiation [21].

Because Hardware Fault Injection is done in actual hardware, there is no need to know the

internal details of the hardware and it really mimics what happens in real systems. It is therefore

very accurate; the effects of the operating system, the latency from IO operations, and other non-

determinist effects are already taken into account. Furthermore, since injections are done in the

actual hardware that is running the workloads, a fault injection campaign takes significantly less time

than software-implemented fault injection.

However, the disadvantages of a hardware fault injection campaign make it very difficult to

consider. First, hardware fault injection needs to be done post silicon, as we need at least a hardware

11


prototype. This is usually too late considering that such reliability analysis is often needed during the

architectural exploration phase of a design. However, the results can help make reliability decisions

for future devices that use a similar technology or architecture. Second, it is very expensive and

time-consuming to build a dedicated custom hardware and submit a hardware through an electron

beam.

2.3.1.2 Software-implemented Fault Injection

In Software Fault Injection, faults are injected in the simulated hardware under test.

Because this is done in a software implementation of the hardware, it can be done in a performance

simulator which is usually available during the architectural exploration phases of a microprocessor

design project. Therefore, the results of a software-implemented fault injection campaign can be used

to influence the design of a new chip. Moreover, since we are using a software implementation of

the hardware, we naturally have more visibility into the internals of the architecture under test. One

drawback of the software fault injection method is that simulation tends to be very slow compared to

the execution of a workload on the native hardware.

2.3.2 ACE Analysis

ACE analysis has first been developed by Mukherjee et al [9] to calculate the Architectural

Vulnerability Factor (AVF) of pipeline structures such as the instruction queue and the Reorder Buffer.

Traditional ACE analysis is implemented in simulation and will determine the AVF of hardware

structures by executing a single pass through a program.

In ACE analysis, the AVF of hardware structures is estimated by tracking the hardware

state bits that are required for Architecturally Correct Execution (ACE). If any fault occurs in a

storage cell containing these ACE bits, and if there is no error correction technique present on the

system, there will be a visible error in the output of the program. The remaining state bits that are

not ACE are called un-ACE bits; they are not required for architecturally correct execution of the

program and a fault in a storage cell containing an un-ACE bit will not cause a visible error at the

output of the program.

The AVF for a single-bit storage cell is the fraction of time it holds an ACE bit. Conse-

quently, the AVF for a hardware structure is the average AVF of its storage cells. ACE analysis

on a structure starts by conservatively assuming that all bits in the structure are ACE bits, then

proceeds to identify bits that can be marked as un-ACE. Un-ACE bits can be categorized as either

12


architectural or microarchitectural un-ACE bits. Examples of architectural un-ACE bits include

bits from NOP instructions, performance-enhancing instructions (e.g., prefetches), predicated-false

instructions, dynamically-dead code, and logical masking. Examples of microarchitectural un-ACE

bits are idle or invalid bits, mis-speculated bits (wrong-path instructions or predictor structure bits),

and microarchitecturally dead bits.

Because ACE analysis generates a conservative value for the AVF of a structure, the AVF

value obtained through ACE analysis can very often be too conservative. It has been shown that even

a refined ACE analysis can overestimate the error vulnerability of a structure by 2-3x [22]. This

can result in overprotection of the structure, which makes a processor uncompetitive. Furthermore,

although ACE analysis gives more insight into the resilience of a structure, performing ACE analysis

on certain structures can be a very involved process.

13

Chapter 3

Transient Faults on GPUs

In this chapter, we review previous studies on the effects of transient faults on GPUs. We

also consider the tools that have been developed to study these effects. The previous studies can be

categorized into fault injection studies and ACE analysis studies.

3.1 Fault Injection Studies

Fault injection evaluates the impact of introducing a fault into the execution of a program.

The execution can be done on live hardware (typically done through radiation beam testing, or a

software-based injector) or in a simulated microprocessor or memory system using software.

3.1.1 GPU-Qin

GPU-Qin [23] is a fault injection tool for GPUs. The tool is built to perform fault injection

studies on real GPUs running CUDA-based applications. It uses CUDA-GDB, the NVIDIA tool for

debugging GPU applications. The applications are first profiled, and then instructions are selected as

fault injection sites. At runtime, GPU-Qin injects a fault into the selected instructions.

The results of a fault injection campaign with GPU-Qin [23] showed that some applications

inherently possess some resiliency characteristics to transient faults. This should be taken into account

when protecting an application against transient faults. In addition, there was a wide variation in

the rates of Silent Data Corruptions (SDC) and crashes across the studied applications. However,

benchmarks with similar behaviors showed similar vulnerability behavior. For example, HashGPU-

sha1 and HashGPU-md5 are respectively SHA1 and MD5 hash implementations of StoreGPU[24], a

library that accelerates a number of hash-based primitives.

14

CHAPTER 3. TRANSIENT FAULTS ON GPUS

The reason for this variability in rate of SDCs in GPU applications is mainly related to the

applications’ characteristics. For example, applications based on search algorithms are likely to have

lower SDC rates than applications that perform computations such as linear algebra because a fault

that affects the search in a part that will not lead to a match, is unlikely to produce an incorrect result.

Applications based on the ”average out” algorithm, such as stencil codes [25], also have a low SDC

rate.

These applications have computations in which the final state is a product or average of

multiple temporary states. Because the final state will be an average or product of temporary states, a

fault affecting an intermediate state is likely to be masked and unlikely to affect the final state.

Another focus of the fault injection study with GPU-Qin was a technique to cluster

applications into five resilience categories based on their SDC rates. Because of the variability in

SDC rates across the applications, and the similarity in resilience among similar algorithms, the

GPU-Qin authors found it useful to categorize the applications based on their SDC rate and the

operations they perform. From this clustering, each of the resilience categories seemed to match very

well with one or many of the dwarves defined by Asanovic et al. [26]

3.1.2 SASSIFI

SASSIFI [27] is another fault injection tool for NVIDIA GPU’s. i SASSIFI is based on

SASSI, a low-level, compiler-based assembly-language instrumentation framework that allows the

injection of code at specific points in a program [28]. SASSIFI injects faults in the destination values

of executing instructions of a running program at the architectural level. This allows for faster fault

injection, increased visibility into the applications and the possibility for a detailed study and analysis

of the magnitude of Silent Data Corruptions (SDC).

SASSIFI provides the user with the ability to trace an SDC all the way back to the

specific fault which produced it, and also the ability to correlate program properties with program

vulnerabilities, which is a key to develop low cost error mitigation schemes. Because SASSIFI injects

faults at the architecture level (as opposed to the microarchitecture level), fault injection experiments

with SASSIFI can only measure the derating that happened at the application level.

A fault injection study using SASSIFI and the Rodinia applications evaluated the variation

in the SDC rate of these applications. Further analysis was done on the injected faults that caused

different outcomes and it was observed that fault injection outcomes vary with different kernels of

the same program, and with different invocations of the same kernel.

15


SASSIFI is similar to GPU-Qin, as it injects single-bit faults in destination values of

executing instructions of a program. One key difference is that SASSIFI is able to inject faults into

control and predicate registers. Another drawback of these tools is that because the instrumenting

instructions need to run on the GPU, code injections may perturb the workload that is running on the

GPU, possibly altering the behavior of the workloads. This can lead to inaccuracy in the reported

reliability results.

Finally, it is worth mentioning that SASSIFI and GPU-Qin inject faults at a level above the

microarchitecture. This means that the derating that comes from hardware structures is not taken

into account in the results. Furthermore, the location for an injection is carefully selected in order to

reduce the population size and easily attain statistical significance.

3.1.3 Multi2Sim Fault Injection [1]

In this thesis we utilize the Multi2Sim simulation framework [29] to provide the basis

for our simulation model of the AMD 5870. We leverage prior work by Farazmand et al. [1] to

build our fault injection framework in Multi2Sim [29]. Multi2Sim is a simulation framework for

CPU-GPU heterogeneous computing. We provide a more detailed description on the environment

and infrastructure in Chapter 4.

In prior work by Farazmand et al., For this fault injection campaign, faults are injected

in structures of the microarchitecture. The results of this campaign show that a great number of

resources are not utilized by the GPU, especially for the small applications that were used. This

results in a very low rate of Silent Data Corruptions and crashes. For the injections in utilized

resources, the GPU demonstrated high resilience, and in many cases, the applications were able to

run to completion without any error in their output.

There were a few interesting implications from this fault injection study. The authors

observed that structures with similar functionality in the CPU and GPU were not necessarily similar

in terms of their vulnerability. In addition, given that very few injections into the register file led to

an error in the program outputs, it makes little sense to dedicate significant resources to protect this

structure. This prior work is the only other fault injection campaign that tries to compute the AVF

values for structures of the GPU.

16


3.2 ACE Analysis studies

3.2.1 GPGPU-SODA

GPGPU-SODA [30] is a framework to evaluate the vulnerability of a GPU to transient

faults. It is built on a cycle-accurate, open-source and publicly available, simulator, GPGPU-Sim.

GPGPU-SODA is capable of estimating the vulnerability of the major microarchitecture structures

in a Streaming Multiprocessor using ACE analysis. GPGPU-SODA attempts to characterize the

vulnerability of different micro-architectural structures a GPU to transient faults through architecture

vulnerability factor (AVF) analysis.

Tan et al.’s study with GPGPU-SODA found that the GPU microarchitecture vulnerability

is highly related to workload characteristics such as the percentage of un-ACE instructions, the per-

block resource requirements and the degree of branch divergence. They also concluded that several

structures are highly susceptible to transient faults, and that the entire GPU should be considered for

protection.

17

Chapter 4

Methodology

As discussed in Chapter 2, fault injection can be performed on a real or a simulated

hardware. GPU-Qin and SASSIFI, for example, inject faults in physical GPUs that are under test.

This fault injection method (in real hardware) will yield very accurate results and can be significantly

faster than the alternative.

However, simulation-based fault injection is useful in the fact it can help GPU designers to

explore potential design decisions. Our simulator as described in this chapter is built for architectural

exploration. Our fault injection mechanism, which is built into Multi2Sim [29], can be used to carry

fault injection experiments while exploration different microarchitectural, compiler and runtime

tradeoffs. In this chapter, we describe the framework that we used to perform the fault injection

campaign, as well as the post-experiment analyses.

4.1 Multi2Sim simulation model

We performed our fault injection campaign in an architectural simulator, Multi2Sim [29].

Multi2Sim is an open-source, modular and fully configurable simulation framework for CPU-GPU

computing. Multi2Sim provides a wide range of CPU and GPU choices. The specific framework used

in this thesis leverages a model of the AMD Evergreen family of GPUs. The Evergreen Instruction

Set Architecture has been used in the implementation of AMD’s mainstream RadeonTM5000 and

6000 series of GPUs. The model implemented in Multi2Sim is similar to the RadeonTM5870 GPU.

Multi2Sim supports both functional simulation and architectural (or detailed) simulation

for the Evergreen family of GPUs. Functional simulation provides traces of Evergreen instructions;

architectural or detailed simulation tracks the execution time and architectural state of the GPU

18

CHAPTER 4. METHODOLOGY

hardware structures. Simulation of a program in the Evergreen model begins with the execution of a

CPU code, the host code of the OpenCL program. The host code is run using the CPU simulation

module of Multi2Sim. The OpenCL API calls from the host program are intercepted and used to

kick-off the GPU simulation.

4.2 Multi2Sim for Fault Injection

The authors of Multi2Sim have introduced the ability to perform fault injection into the

execution of a program [31]. Using Multi2Sim, we are able to assess the reliability of individual

hardware structures. We can inject faults during any cycle of the runtime of any hardware structure

that is modelled by Multi2Sim. The fault injection mechanism is not specific to the Evergreen model,

and therefore can be implemented and applied to different GPU architectures in Multi2Sim. In

addition to the study presented in Chapter 3, the fault injection mechanism in Multi2Sim has been

extended and used in other fault injection studies [32] [33] [1].

When using Multi2Sim to inject a fault in a hardware structure during a simulation, a fault

definition file is fed to the simulator. This fault definition file contains the following information:

a) the targeted hardware structure, b) the specific fault location, and c) the injection time. The fault

location is the position of the bit within the hardware structure where the fault should be injected and

the injection time is the simulation cycle where the injection is performed.

Figure 4.1: Possible outcomes for each simulation run

A fault is represented by a bit flip in the simulated hardware structure at the specified cycle

and location. The faulty value is either propagated to other locations in the simulation model, or is

masked by the program. The programs used in our experiment have a self-check mechanism. This

mechanism compares the output of the GPU program to a reference golden precalculated output. The

possible outcomes of a single simulation with a fault injection are presented in Figure 4.1.

19


For each structure and each application, a total of 10,000 single faults are injected. The

statistical significance of this number of experiments is discussed in Section 4.3. In order to calculate

the Architectural Vulnerability Factor, we compute the number of fault injections that result in a

program failure (SDCs) and divide by the total number of faults injected.

4.3 Statistical Significance

Figure 4.2: This graph shows how the AVF value changes based on the number of fault-injectionexperiments. We notice that the AVF value shows little variation and stabilizes after 5,000 injections.

Our goal is to statistically estimate the AVF of the workloads running on an AMD Evergreen

GPU. We want to choose a number of simulations that is large enough for our results to achieve

statistical significance, and not so large that performing the experiments become burdensome.

In order to reach this confidence level, we computed the AVF of the structures with a

varying number of injections. The results of this experiment are shown in Figure 4.2. We found that

the AVF value varies significantly with a small number of injections. However, after the number of

injections passes 5,000 there is little variance in the AVF value.

20


Figure 4.3: Formula for the number of faults to select for injection

We also verify that the number of faults injected into every structure is statistically signifi-

cant using the methodology presented by Leveugle et al. [19]. According to their methodology, given

a confidence level, the sample size n, or number of faults to randomly select for injection, can be

computed with the formula in Figure 4.3. The variables in this formula are:

• N: initial population size. This is the number of all the potential injection sites.

• p: estimated probability of faults resulting in an error. The authors demonstrated that p=0.5 is

a sufficient value to use in our experiments.

• e: margin of error. This is the most sensitive parameter in the formula. Reducing this parameter

can increase the sample size very quickly. We chose 0.005 as our margin of error

• t: cut-off point or confidence level. We chose 95% for our confidence level

Using this formula, we computed the required sample size for the vector register file of the

GPU while running the ScanLargeArrays workload. ScanLargeArrays runs for less than 7.5 million

cycles. To guarantee a margin of error of less than 0.05%, we would be required to inject 9,800

faults. Given that most of our applications run for less than 7.5 million cycles, and that the number

of potential injection sites on the local memory are far less than that of the register file, the initial

population size is then the highest in the case of ScanLargeArrays. Thus, we can easily argue that

this case will give us the highest error margin. Consequently, with 10,000 injections used in this

thesis, our results will have a margin of error of less than 0.5%.

4.4 Post-Experiment Analysis

Fault-injection campaigns very often treat the system under evaluation as a black box.

After the results of the campaign are reported, researchers come out with a number that measures the

resilience of the system. Our goal is primarily to offer insights to GPU application developers into

21


the vulnerability of their applications. Data obtained from our fault injection campaign allows us

to perform more analysis on the injected faults throughout the execution of the programs. Because

we have the data from each fault and their outcome at the end of the simulation, we are able to

perform an per-time-interval vulnerability study on the applications. That is, we can track the

vulnerability of an application during the course of its execution. We hope that these results, coupled

with their application profiles, will be helpful to programmers when evaluating the vulnerability of

their applications.

4.5 Evaluation Framework

Number of compute units 1Number of Stream Cores 16Number of Vector Registers 16384Number of Memory Banks 32

Table 4.1: The GPU configuration used in the experiments

4.5.1 Platform for Evaluation

The configuration used for the GPU model used in the experiments is presented in Table 4.1.

We have used the AMD RadeonTM5870 as the base configuration. An overview of the Evergreen

architecture is provided in Figure 1.2.

Because our applications are small compared to the size of the workloads that a GPU can

potentially run, we have used only one compute unit in order to maximize the occupancy of the GPU.

The faults can therefore be injected into usable resources of the GPU. We argue that focusing on the

reliability of a single CU should not signficantly impact the fidelity of our AVF values.

4.5.2 Evaluated Benchmarks

The applications are taken from the AMDAPP SDK [34] and they are common general-

purpose GPU applications. These applications were chosen because they provide a representative

cross-section of common GPU applications. The list of selected applications is shown in Table 4.2.

22


Benchmark DescriptionBNSRCH Binary Search Binary Search finds the position of a given

element in a sorted array. Instead of dividingthe search space at every pass, it is dividedinto N segments and is called N’ary search.Computation complexity is log to base N.

BSRT Bitonic Sort Sorting network of nlog2n comparators.Performs best when sorting a smallnumber of elements.

DCT DCT Discrete Cosine Transform is a commontransform for compressions of 1D and 2Dsignals such as audio, images and video.

HGRM Histogram Calculates the histogram of n arrayMMUL MatrixMultiplication Performs the multiplication

of two matricesMTRNS MatrixTranspose Matrix transpose optimized to

coalesce accesses to shared memoryand avoid bank conflicts

PSUM PrefixSum Computes an array whichis the running totals of the elementsof the input array.

RDXS RadixSort Radix-based sorting algorithmstreat keys as multi-digit numbersin which each digit is an integerwith a value ranging from 0 to m,where m is the radix.

SLA ScanLargeArrays This scan is based on a prefix sumbut the scan is done block-wise,then the blocks are combined into asingle result array.

Table 4.2: The benchmarks used in the experiments

23

Chapter 5

Results and Analysis

Multiple factors can affect the vulnerability of a structure. Vulnerability can differ based

on kind of computation a program performs, whether or not the application can tolerate approximate

output values, and the level of occupancy of the structure where the fault was injected. Also,

the microarchitecture can mask faults. Specifically, an identical fault injected in an alternative

microarchitecture can lead to very different outcomes.

In this thesis, we will look at is the liveness of a structure. We will examine how a

programmer can reduce the vulnerability of a structure by reducing the liveness of that structure.

5.1 Local Data Storage

The results derived from our fault injection study in the Local Data Share are presented

in Figure 5.1. From these results, we observe that some applications do not make use of the LDS

and therefore show no vulnerability. Other applications either partially or fully use the Local Data

Storage (LDS) and show a wide range of vulnerability. Figure 5.2 shows the maximum amount of

local memory used by each application.

Some applications, notably BitonicSort and BinarySearch do not use the Local Data

Storage in their algorithm. There is no apparent sharing between work-items within a workgroup.

Therefore, any fault injected in the LDS will not have an an impact on visible output of the program.

Because the locations for the bit flips were randomly chosen, and because most of the

applications are small in nature and can make minimal use of the LDS, many of the flipped bits were

into unallocated portions of the LDS. This can be seen in applications have a high percentage of

24

CHAPTER 5. RESULTS AND ANALYSIS

Figure 5.1: Results of Fault injection experiments on the Local Data Share

25


Figure 5.2: Amount of local memory used by each application for each NDRange mapped to thecompute device

26


NoFaults (benign faults). For example, in PrefixSum, DCT and ScanLargeArrays, a large number

of the faults were injected into non-utilized portions of the LDS.

Other applications, such as Histogram, MatrixMultiplication, RadixSort and Scan-

LargeArrays heavily use the LDS. However, the failure rate for these applications varies widely.

Histogram experiences very little resilience to faults, while MatrixMultiplication is highly tolerant

to faults. When running MatrixMultiplication, the probability that a fault in the LDS leads to a

program visible error in the output matrix is around 1%. This suggests that the inherent resilience of

an application should be taken into account when protecting it from transient faults.

5.2 Register File

Figure 5.3: Results of Fault injection experiments on the General Purpose Register File

The results of the injection fault campaign in the register file of the GPU are shown

27


Figure 5.4: Number of General-Purpose registers used by each application for each NDRangemapped to the compute device

28


in Figure 5.3. These results show that the register file of the GPU is a highly resilient structure. A

good portion of the activated faults get masked, and so do not affect the output of the program.

Figure 5.4 shows the maximum number of registers used for each NDRange mapped to a

compute device per benchmark.

5.3 How does vulnerability vary over time?

To understand the inherent resilience of an application, we look to see how vulnerability of

an application varies over time. Our goal is to identify points in time where the application is most

vulnerable (or least vulnerable) and match them with specific architectural events that can be under

the control of a developer. This should provide insights to a programmer or application designer on

how to make their applications more fault-tolerant. HERE

To identify points in time where an application is most vulnerable, we divide the application

simulation time in cycle-intervals. Then we look all the faults that were injected during this cycle

and their outcomes. We also verify that the small number of faults into each interval is statistically

significant using the formula from Leveugle et al [19], presented in 4.3. We estimate that the number

of faults needed to obtain a 95% confidence interval in our results is about 90. Because the faults are

uniformly distributed over time, we are confident that our results are statistically meaningful. We

found some interesting findings which can beneficial to application designers. We cannot guarantee

that the observed trends can be seen in every application; however they can offer some guidelines

when designing fault-tolerant applications. We are presenting our findings in a few case studies that

we present below.

5.3.1 Case Study: LDS - RadixSort

The LDS-AVF variation over time for RadixSort is shown in Figure 5.5. This graph shows

that midway through the running of RadixSort, the application is highly vulnerable. Almost any fault

in these intervals will lead to an incorrect result. We conduct some analysis to examine the reasons

behind the high vulnerability of this application.

The RadixSort application has 3 phases: a histogram which runs on the GPU, a scan of the

generated histogram which runs on the CPU and a permute which runs again on the GPU. Figure 5.6

shows the in-flight accesses to the LDS during the course of the running of the application. The

29


Figure 5.5: Intervals of Vulnerability for Radixsort. This shows that the faults that lead to incorrectoutput fall only into specific intervals of time

30


Figure 5.6: Intervals for LDS accesses for Radixsort.

31


application loops four times, each time operating on a different part of the 32-bit integers, starting

with the least significant.

We see that the first iteration is not vulnerable at all. This is because this iteration is done

on low-order bits of the integers. In the other iterations, the histogram kernel is the most vulnerable,

with the 3rd iteration being almost 100% vulnerable.

5.3.2 Case Study: MatrixMultiplication

Figure 5.7: Intervals of Vulnerability for MatrixMultiplication. This vulnerability of MatrixMultipli-cation shows a periodic behavior

The LDS-AVF time variation for MatrixMultiplication is shown in Figure 5.7. This figure

shows that the vulnerability of the LDS with MatrixMultiplication presents a periodic behavior with

some short intervals of no vulnerability. This is usually explained by the fact that the application is

reading locations of memory in chunks and re-writing other locations. In Figure 5.8, we show how

the accesses to the LDS vary over time. The local memory accesses also show a periodic behavior,

32


Figure 5.8: Local Memory accesses in MatrixMultiplication show a periodic behavior

33


with brief spurts in writes followed by longer periods of memory reads. We coincidentally see that

the intervals of low vulnerability of the LDS fall right before the memory writes. This is because

the application has finished using the values stored in memory right before it initiates the writes

that would overwrite the memory locations. Therefore, a fault in an LDS location right before the

location gets overwritten is not going to affect the result of the Matrix Multiplication.

34

Chapter 6

Conclusion

In this thesis, we conducted a thorough characterization of the effects of particle-induced

errors in the vector register file and the local data share of a GPU from the AMD Evergreen family

of GPUs. Our study shows that the vulnerability of the common GPU applications varies widely

depending on the implementation of the application.

We also look at the a few trends that can be exploited by application developers in order to

make their applications more robust. We observed that some applications show a periodic behavior

in their vulnerability and are most vulnerable at specific intervals during their execution. Overall, the

longer the intervals of vulnerability and the more intervals of vulnerability, the more vulnerable the

application is. One general rule of thumb for a highly resilient application is to reduce the liveness of

useful data, i.e. reduce the time between writing to a structure and reading the written data. Also,

because, caches are more likely to be protected, a GPU application developer seeking to make his/her

application more reliable should consider storing his/her data in global memory as a trade-off for

better performance.

In future work, we plan on developing a more comprehensive vulnerability analysis

framework. A fault injection campaign is a very lengthy process and provides very little detail with

regards of the sources of masking in a hardware structure. A comprehensive reliability framework will

help us identify the sources of vulnerability and provide better insights to both hardware designers

and application developers.

35

Bibliography

[1] N. Farazmand, R. Ubal, and D. Kaeli, “Statistical fault injection-based analysis of a GPU

architecture,” in Workshop on Silicon Errors in Logic - System Effects (SELSE), 2012.

[2] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, Fourth Edition,

Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer

Architecture and Design), 4th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers

Inc., 2008.

[3] G. E. Moore, “Cramming more components onto integrated circuits,” Proceedings of the IEEE,

vol. 86, no. 1, pp. 82–85, Jan 1998.

[4] E. Alerstam, T. Svensson, and S. Andersson-Engels, “Parallel computing with graphics process-

ing units for high-speed monte carlo simulation of photon migration,” Journal of biomedical

optics, vol. 13, p. 060504, 2008.

[5] M. C. Schatz, C. Trapnell, A. L. Delcher, and A. Varshney, “High-throughput sequence

alignment using graphics processing units,” BMC Bioinformatics, vol. 8, no. 1, pp. 1–10, 2007.

[Online]. Available: http://dx.doi.org/10.1186/1471-2105-8-474

[6] J. E. Stone, J. C. Philips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten, “Accel-

erating molecular modeling applications with graphics processors,” Journal of Computational

Chemistry, vol. 28, pp. 2618–2640, 2007.

[7] S. Grauer-Gray, W. Killian, R. Searles, and J. Cavazos, “Accelerating financial applications on

the gpu,” in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics

Processing Units, ser. GPGPU-6. New York, NY, USA: ACM, 2013, pp. 127–136. [Online].

Available: http://doi.acm.org/10.1145/2458523.2458536

36

http://dx.doi.org/10.1186/1471-2105-8-474

http://doi.acm.org/10.1145/2458523.2458536

BIBLIOGRAPHY

[8] J. W. Sheaffer, D. P. Luebke, and K. Skadron, “The visual vulnerability spectrum:

Characterizing architectural vulnerability for graphics hardware,” in Proceedings of

the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, ser.

GH ’06. New York, NY, USA: ACM, 2006, pp. 9–16. [Online]. Available:

http://doi.acm.org/10.1145/1283900.1283902

[9] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin, “A systematic methodology

to compute the architectural vulnerability factors for a high-performance microprocessor,”

in Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International

Symposium on, Dec 2003, pp. 29–40.

[10] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. K. Iyer, “Nftape: a framework for

assessing dependability in distributed systems with lightweight fault injectors,” in Computer Per-

formance and Dependability Symposium, 2000. IPDS 2000. Proceedings. IEEE International,

2000, pp. 91–100.

[11] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson, “Goofi: generic object-oriented fault in-

jection tool,” in Dependable Systems and Networks, 2001. DSN 2001. International Conference

on, July 2001, pp. 83–88.

[12] E. S. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” in Supercom-

puting, ACM/IEEE 2001 Conference, Nov 2001, pp. 43–43.

[13] T. K. G. T. O. Standard, www.khronos.org/opencl.

[14] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming

with cuda,” Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008. [Online]. Available:

http://doi.acm.org/10.1145/1365490.1365500

[15] S. Mukherjee, Architecture Design for Soft Errors. San Francisco, CA, USA: Morgan

Kaufmann Publishers Inc., 2008.

[16] D. Bossen, “Cmos soft errors and server design,” IEEE 2002 Reliability Physics Tutorial Notes,

Reliability Fundamentals, vol. 121, pp. 07–1, 2002.

[17] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” IEEE Computer,

vol. 30, no. 4, pp. 75–82, Apr. 1997.

37

http://doi.acm.org/10.1145/1283900.1283902

http://doi.acm.org/10.1145/1365490.1365500

BIBLIOGRAPHY

[18] C. Constantinescu, M. Butler, and C. Weller, “Error injection-based study of soft error

propagation in amd bulldozer microprocessor module.” in DSN, R. S. Swarz, P. Koopman,

and M. Cukier, Eds. IEEE Computer Society, 2012, pp. 1–6. [Online]. Available:

http://dblp.uni-trier.de/db/conf/dsn/dsn2012.html

[19] R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert, “Statistical fault injection: Quantified

error and confidence,” in Design, Automation Test in Europe Conference Exhibition, 2009.

DATE ’09., April 2009, pp. 502–506.

[20] J. Arlat, Y. Crouzet, and J.-C. Laprie, “Fault injection for dependability validation of fault-

tolerant computing systems,” in Fault-Tolerant Computing, 1989. FTCS-19. Digest of Papers.,

Nineteenth International Symposium on, June 1989, pp. 348–355.

[21] J. Karlsson, P. Liden, P. Dahlgren, R. Johansson, and U. Gunneflo, “Using heavy-ion radiation

to validate fault-handling mechanisms,” Micro, IEEE, vol. 14, no. 1, pp. 8–23, Feb 1994.

[22] N. J. Wang, A. Mahesri, and S. J. Patel, “Examining ace analysis reliability estimates using

fault-injection,” in Proceedings of the 34th Annual International Symposium on Computer

Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007, pp. 460–469. [Online].

Available: http://doi.acm.org/10.1145/1250662.1250719

[23] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for

evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and

Software (ISPASS), 2014 IEEE International Symposium on, March 2014, pp. 221–230.

[24] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, “Storegpu:

Exploiting graphics processing units to accelerate distributed storage systems,” in Proceedings

of the 17th International Symposium on High Performance Distributed Computing, ser.

HPDC ’08. New York, NY, USA: ACM, 2008, pp. 165–174. [Online]. Available:

http://doi.acm.org/10.1145/1383422.1383443

[25] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and

K. Yelick, “Stencil computation optimization and auto-tuning on state-of-the-art multicore archi-

tectures,” in 2008 SC - International Conference for High Performance Computing, Networking,

Storage and Analysis, Nov 2008, pp. 1–12.

38

http://dblp.uni-trier.de/db/conf/dsn/dsn2012.html

http://doi.acm.org/10.1145/1250662.1250719

http://doi.acm.org/10.1145/1383422.1383443

BIBLIOGRAPHY

[26] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.

Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The landscape

of parallel computing research: A view from berkeley,” EECS Department, University

of California, Berkeley, Tech. Rep. UCB/EECS-2006-183, Dec 2006. [Online]. Available:

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

[27] S. K. S. H. Hari, T. Tsai, M. Stephenson, S. Keckler, and J. Emer, “SASSIFI: Evaluating

resilience of gpu applications,” in Workshop on Silicon Errors in Logic - System Effects

(SELSE), 2015.

[28] M. Stephenson, S. K. Sastry Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans,

M. O’Connor, and S. W. Keckler, “Flexible software profiling of gpu architectures,” in

Proceedings of the 42Nd Annual International Symposium on Computer Architecture,

ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 185–197. [Online]. Available:

http://doi.acm.org/10.1145/2749469.2750375

[29] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “ Multi2Sim: A Simulation Framework for

CPU-GPU Computing ,” in Proc. of the 21st International Conference on Parallel Architectures

and Compilation Techniques, Sep. 2012.

[30] J. Tan, N. Goswami, T. Li, and X. Fu, “Analyzing soft-error vulnerability on gpgpu microarchi-

tecture,” in Workload Characterization (IISWC), 2011 IEEE International Symposium on, Nov

2011, pp. 226–235.

[31] R. Ubal, D. Schaa, P. Mistry, X. Gong, Y. Ukidave, Z. Chen, G. Schirner, and D. Kaeli,

“Exploring the heterogeneous design space for both performance and reliability,” in Proceedings

of the 51st Annual Design Automation Conference, ser. DAC ’14. New York, NY, USA: ACM,

2014, pp. 181:1–181:6. [Online]. Available: http://doi.acm.org/10.1145/2593069.2596680

[32] F. Previlon, M. Wilkening, V. Sridharan, S. Gurumurthi, and D. Kaeli, “Examining the impact

of ace interference on multi-bit avf estimates,” Proceedings of SELSE-8: Silicon Errors in

Logic-System Effects, 2015.

[33] M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, and D. R. Kaeli, “Calculating

architectural vulnerability factors for spatial multi-bit transient faults,” in Proceedings of

the 47th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-47.

39

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

http://doi.acm.org/10.1145/2749469.2750375

http://doi.acm.org/10.1145/2593069.2596680

BIBLIOGRAPHY

Washington, DC, USA: IEEE Computer Society, 2014, pp. 293–305. [Online]. Available:

http://dx.doi.org/10.1109/MICRO.2014.15

[34] “AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK),”

http://developer.amd.com/sdks/AMDAPPSDK.

40

http://dx.doi.org/10.1109/MICRO.2014.15

Documents

Architectural vulnerability factor estimation through ...cj82n890p/fulltext.pdfAs GPU hardware vendors introduced programmable shaders, AMD and NVIDIA intro-duced support for OpenCL