23
FAULT TOLERANCE IN FPGA SYSTEMS Karisma Ramesh 451126715 CSE661 December 12, 2014

Fault Tolerance in an FPGA System

Embed Size (px)

Citation preview

Page 1: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS

Karisma Ramesh

451126715

CSE661

December 12, 2014

Page 2: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 1 of 22

Abstract

Field Programmable Gate Arrays (FPGAs) are of high importance in many

different applications, and represent a lot of money in the computer engineering

community. They are highly reconfigurable, and thus highly complex. FPGAs must

work reliably, and so methods of detection, diagnosis, and fault tolerance have been

found to make sure that FPGAs work when they are needed, the abundance of

configurable logic elements and routing resources in recent FPGAs provides a cost-

effective method for tolerating permanent faults in the system. Once a permanent

fault occurs, the FPGA is reconfigured by replacing the faulty part with previously

unused resources in the same hardware. Reliability and process variability are

serious issues for FPGAs in the future. Fortunately FPGAs have the ability to

reconfigure in the field and at runtime, thus providing opportunities to overcome

some of these issues. This paper provides a comprehensive survey of fault tolerance

schemes specifically for FPGAs, with the goal of laying a strong foundation for future

research in this field. All methods and schemes are qualitatively compared and some

particularly promising approaches highlighted. Methods for fault tolerance that cover

single and multi-FPGA systems, as well as for systems experiencing singular faults or

multiple faults at once are listed. In this paper, two column-based precompiled

configuration techniques are presented for tolerating permanent faults in FPGA-

based systems. By compiling alternative configuration versions in the design phase,

these approaches ensure fast reconfiguration, and thus a tremendous increase in

system availability. In addition, intentional similarities are created among different

configuration versions so that the storage overhead due to precompiled

configurations is reduced by orders of magnitude through differential coding and

run-length coding.

Page 3: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 2 of 22

Introduction

As process technology scaling continues, manufacturing large defect-free

integrated circuits will become increasingly difficult. There is also the added problem

of degradations over time after a device has been successfully deployed in the field.

These reliability issues are particularly acute with FPGAs. Similar to memories, FPGAs

have high density of Transistors and interconnect wires. A recent study suggested

that future FPGAs at and beyond the 45nm technology node will have such low yield

that defect tolerance scheme will be unavoidable in large FPGAs. FPGAs present

interesting opportunities for fault tolerance due to their ability to be reconfigured. A

fault tolerant system consists of two main components; these are fault detection and

fault repair. Design of fault-tolerant systems (FTS) is gaining popularity and

importance, because high reliability and high availability are becoming a standard

requirement. Typical application areas are the transportation, communications, on-

line process control, or multiple-access transaction processing. Fault tolerance (FT)

can be carried out by means of using spare resources to replace the faulty ones and

to take over their function. However, it mostly leads to increased area overhead and

degradation in the system performance, but also to the increase in energy

consumption, heat dissipation and weight and volume increase. Due to the current

progress in semi-conductor technology, these drawbacks are becoming less

important and are largely overweighed by the desirable functional properties of the

resulting systems. There are many approaches to FTS design that can be classified

according to several criteria. To the most important criteria belongs the level of FT

tool application. Here we can distinguish three main streams: system-level FT,

building block (intermediate) level FT and component-level FT. As the designers

mostly cannot change the internal structure of the components used, the first two

approaches are used more frequently and are therefore developed to a higher degree

of sophistication.

Page 4: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 3 of 22

FPGA Architecture:

In general, FPGAs consist of regular arrays of programmable logic blocks (PLBs)

connected to each other by a programmable routing matrix. An FPGA configuration

defines the functionality of an FPGA, specifying which logic blocks are used and which

wire segments are used to connect them, as well as what functionality each block

provides. As in Figure 1.

Figure 1: Figure 1: Architecture of the programmable logic core in FPGAs.

In this model, the programmable logic core of an FPGA consists of an array of three

basic elements: Configurable Logic Blocks (CLBs), Connection Boxes (CBs), and

Switch Boxes (SBs).In SRAM-based FPGAs, a CLB contains several SRAM lookup tables

(LUTs) to store user-defined logic functions. A CLB also contains flip-flops,

multiplexers, and dedicated circuitry to optimize the performance of user

applications. CLBs are connected through horizontal and vertical wiring channels

between two neighboring rows or columns. To enhance the connectivity, there are

various kinds of wires with different lengths for connecting CLBs that are different

blocks apart. For example, in Xilinx Virtex-series FPGAs as in Figure 2, single lines

connect adjacent CLBs, while hex lines connect CLBs that are three or six blocks apart

There are two types of routing devices, CBs and SBs, to direct the signal flows among

CLBs and wiring channels. CBs serve as a local bridge between CLBs and the adjacent

wiring channels. SBs are switch matrices that connect horizontal and vertical wiring

Page 5: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 4 of 22

channels. In SRAM-based FPGAs, the state of connections in these routing devices is

controlled by SRAM cells, which are configured according to the desired functionality.

Figure 2: Configuration frame architecture in Xilinx Virtex-series FPGAs.

Fault Tolerance

A fault can be defined as a physical occurrence within an FPGA that causes it

to malfunction, such as a broken wire caused during manufacture by a dust particle.

Faults usually occur at the beginning and end of a chip’s life cycle. Fabrication faults,

or defects, are usually caused by contaminants or other flaws in the manufacturing

process and are detected during manufacture testing. Late life faults are usually due

to failure of device resources. However, faults can and do appear at any point during

the life cycle of an FPGA. Faults may also be transient (impermanent). Transient faults

are more difficult to detect because they do not constantly manifest themselves. As

seen in Figure 3, the failure rate after the chip has left the factory and before the end

of its life is typically constant and is due to environmental stresses. The time axis may

be compressed, possibly significantly, if the device is used in a harsh environment.

Page 6: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 5 of 22

Figure 3: System failure rate during the life cycle of an FPGA.

FPGA fault tolerance methods can be divided into two groups, based on the

level of abstraction. The first group attempts to deal with faults at the level of the

FPGA hardware. These methods, which will be termed device-level (DL), generally

attempt to construct a fault-free array from a larger array containing faulty resources.

When faults are discovered, permanent routing or logic changes are made within the

FPGA, selecting redundant hardware resources to replace the faulty ones. Because

alterations must be made at the hardware level, DL methods are generally only used

for yield enhancement.

The second group of methods takes a higher-level approach, tolerating

faults at the level of the FPGA configuration. These configuration-level (CL) methods

treat the FPGA as a set of abstract resources, often represented as a graph structure,

without considering the actual physical structure of the device. When a circuit is

placed-and-routed, fault-free resources are selected from the set of available

resources. The status of resources as faulty vs. fault-free is considered each time a

circuit is placed-and-routed, so CL methods are able to tolerate new faults in the field.

Obviously CL methods are not transparent to the tools, so tolerance of new faults

requires additional configuration time. Both classes of fault tolerance have strengths

and weaknesses, and not all approaches fit neatly into one of these categories.

Deciding which method is best for a given purpose depends on several metrics,

including speed, power consumption, size, and cost. Generally, DL methods are used

by manufacturers, while CL methods are used by end-users.

Page 7: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 6 of 22

Application of Fault Tolerance

All of the fault detection and repair methods surveyed have individual strengths and

weaknesses and which method is most appropriate depends on the application. In

some cases reliability is critical for safety or mission success. Fast detection is crucial

here so that erroneous data or state is not acted upon.

Another class of reliable system is one where a physical repair is not practical. Fault

coverage and repair flexibility are important here. In the light of variability and

reliability concerns associated with future VLSI process nodes, it may become

economical to use fault tolerance in general purpose, high volume applications. In this

case, it will be important that the detection and repair method has the lowest possible

overhead on timing performance and area.

METHODS OF FAULT DETECTION

The first step of a fault-tolerant scheme is fault detection. Fault detection has two purposes;

firstly, it alerts the supervising process that action needs to be taken for the system to remain

operational and, secondly, it identifies which components of the device are defective so that

a solution can be determined. These two functions may be covered simultaneously, or it may

be a multi-stage process comprising of different strategies. Fault detection methods can be

categorised into three broad types:

1. Redundant/concurrent error detection uses additional logic as a means of

detecting when a logic function is not generating the correct output.

2. Off-line test methods cover any testing which is carried out when the FPGA is not

performing its operational function.

3. Roving test methods perform a progressive scan of the FPGA structure by swapping

blocks of functionality with a block carrying out a test function.

Page 8: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 7 of 22

Functional redundancy and Concurrent Error Detection (CED)

Redundancy is widely used as a method of fault detection in FPGAs, particularly in the

form of Triple Modular Redundancy (TMR). The main driver for error detection of

this kind is the need to detect and correct errors due to SEUs and SETs. However,

these methods will detect any error which occurs whilst the system is operating.

These detection methods work by invoking more than the minimum amount of logic

than is necessary to implement the logic function. When an error occurs, there is a

disagreement between the multiple parts of the circuit over which a particular

calculation is processed and this is flagged by some form of error detection function.

The simplest form is modular redundancy. A functional block is replicated, usually

two or three times and the outputs compared. Any difference indicates that a fault is

present. Concurrent Error Detection (CED) allows a more space efficient design than

modular redundancy. Data flows and stores are widened using error coding

algorithms such as parity. Data validation circuitry at the output to functional blocks

can then detect faults which arise. Redundancy provides a very fast means of error

detection, as a fault is uncovered as soon as a discrepancy occurs. In addition, this

form of error detection has a small impact on timing performance; just the latency of

voting or parity logic, or similar. The chief drawback of error detection using

redundancy is the area overhead needed to replicate functionality, which can be over

three times in the case of TMR .Furthermore it provides a limited resolution for

identification of the faulty component. The fault can only be pinned down to a

particular functional block or, in the case of TMR, an instance of a functional block.

Fault resolution can be increased to a certain extent by breaking functional areas

down and adding additional error detection logic. Modular redundancy detects all

faults which become manifest at the output of a functional block, including transient

errors. In CED, coverage comes at the expense of additional area. These methods

provide no coverage of dormant faults, which may be relevant if an FPGA is going to

be reconfigured in the field, either for fault repair or to alter the functionality.

Redundancy does not have to be restricted to the circuit-area dimension. It is also

possible to detect errors in a trade-off with latency/data throughput. Proposes a

Page 9: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 8 of 22

scheme where operations are carried out twice. In the second operation, operands

are encoded in such a way that they exercise the logic in a different way. The output

is then passed through a suitable decoder and compared to the original. Although

most of the work on redundancy has been aimed at detecting and correcting SEUs,

there have been some notable publications which apply the techniques tofault

detection. Parity checking is used as part of a fault tolerant scheme which is

structured so that detection is applied to small regular networks, rather than being

bespoke to the function that is implemented. Redundant and data-checking detection

systems are generally designed into an FPGA configuration, as they fit around the

specific data and control functions that are implemented.

Off-line fault detection / Built-In Self-Test

Off-line fault detection is another widely-used technique; usually as a means of

quickly identifying manufacturing defects in FPGAs. Any scheme which does this

without any external test equipment is known as Built-In Self-Test (BIST), and is a

suitable candidate for fault detection in the field. BIST schemes for FPGAs work by

having one or more test configurations which are loaded separately to the operating

configuration. Within the test configuration is a test pattern generator, an output

response analyser and, between them, the logic and interconnect to be tested,

arranged in paths-under-test. To be fully comprehensive, a BIST system will have to

test not only the logic and interconnect, but also the configuration network. Many

recent FPGAs feature a self-configuration port which can make this possible without

the need for a large number of different test configurations, which can take a long

time to load. Specialised features such as carry chains, multipliers and PLLs also need

to be considered. Compared to traditional built-in and external test methods for ASICs,

FPGAs have the advantage of a regular structure which does not need a new test

programme to be developed for each design. Also, the ability to reconfigure an FPGA

reduces or removes the need for dedicated test structures to be built into the silicon.

However, with the ability to reconfigure comes a vast number of permutations in the

way the logic can be expressed, making optimization of test patterns important. The

advantage of BIST as a fault detection method is that it has no impact on the FPGA

Page 10: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 9 of 22

during normal operation. The only overhead is the requirement to store test

configurations which are typically small. BIST also allows complete coverage of the

FPGA fabric, including features which may be hard to test with an on-line test system,

such as PLLs and the clock network. The major drawback of BIST is that it can only

detect faults during a dedicated test mode when the FPGA is not otherwise

operational. This can be either during system start-up, as part of a maintenance

schedule or in response to an error detected by some other means. Published BIST

methods have competed for coverage, test duration and memory overhead. Many

focus on testing just one subset of FPGA structures, e.g. interconnect, suggesting a

multi-phased approach for testing the whole chip. Testing of LUTs is a mature field

and recent developments have considered timing performance as well as stuck-at

faults. Many publications have focussed on testing interconnect in esponse to the

large amount of configuration logic and silicon area it consumes .In a BIST system for

interconnect is given which reduces test time through a large degree of self

configuration. Elements of BIST can be found in roving test systems, where only a

small part of the FPGA is taken off-line for testing at any point in time.

Roving fault detection

Roving detection exploits run-time reconfiguration to carry out BIST techniques on-

line, in the field, with a minimum of area overhead. In roving detection, the FPGA is

split into equal-sized regions. One of these is configured to perform self-test, while

the remaining areas carry out the design function of the FPGA. Over time, the test

region is swapped with functional regions one at a time so that the entire array can

be tested while the FPGA remains functional. Roving test has a lower area overhead

than redundancy methods; the overhead comprising of one self-test region and a

controller to manage the reconfiguration process. The method also gives excellent

fault coverage and granularity, comparable to BIST methods. The speed of detection,

while better than off-line methods that cannot test during normal operation, is not as

good as redundancy techniques. The detection speed depends on the period of a

complete roving cycle; the best reported implementations of roving test have

Page 11: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 10 of 22

maximum detection latency in the order of a second . Roving test impacts

performance in two ways. Firstly, as the test region is moved through the FPGA,

connections between adjacent functional areas are stretched. This results in longer

signal delays and may force a reduction in the system clock speed, reported to be in

the range of 2.5- 15% .Secondly, implementations in current FPGAs require the

functional blocks to be halted as they are switched.

Methods of Fault Tolerance

Single Fault Tolerance:

These are mostly transient faults caused by extraordinary circumstances in the

environment. Examples of these are charged particles striking the FPGA while it is in

space, or radioactive materials sending out energy which lodges inside the FPGAs

vulnerable systems. Because SEU faults are so common a plethora of methods have

been devised to mitigate them. Described here are some of the more noteworthy

efforts in this area.

Researchers in Milan used Triple Modular Redundancy and partial dynamic

reconfiguration to mitigate SEU faults in FPGAs. Their TMR was the simple

passive model, and they showed that this successfully mitigates soft error

effects. They also evaluated different levels of granularity. This led to an

examination of the tradeoffs between cost, performance, and recovery time.

A group at the Federal University of Rio Grande in Brazil created a fault

tolerant system for use in space, which as stated has many environmental

hazards that must be dealt with. Their system was implemented on a Xilinx

Virtex FPGA, but could be extended to other brands. They determined that

transient faults can cause a permanent effect because of the permanent nature

of SRAM-based FPGA cells. Their method used Duplication with Comparison

and Concurrent Error Detection based on time redundancy. They chose this

way because it worked well without the power and pin overhead caused by a

Page 12: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 11 of 22

traditionally TMR system. In addition this reduces the necessary hardware by

replacing some hardware redundancy with time redundancy

At the University of Florence a group devised a fault tolerant architecture that

avoids the effects of Single Event Upset in avionics applications. They showed

that the error rate of a device increases rapidly as the altitude of the FPGA

increases. They also found that SRAM FPGAs are used much more in avionics

applications than other types of FPGA devices. Further they found that the

configuration memory has the worst upset rate out of the pieces that make up

the critical components of the Integrated Control Panel for military aircraft

cockpits. In order to avoid these catastrophic single event upsets they used a

form of information redundancy that fixed the problems while also saving on

the cost and usage of FPGA resources. They found that for this particular

application and under these circumstances that a combination of a SEC-DED

code and a parity bit provided the best solution. The SEC-DED code protects

registers from upset, and the parity bit shields signals against transient faults.

Multiple Fault Tolerance :

While single faults account for many of the problems that FPGA systems encounter there are

some environments or some applications where multiple faults can happen simultaneously.

Some of these situations are due to the fact that feature size continues to decrease. This by

itself can cause many faults in a manufactured device.

One such method utilizes unused resources present in the FPGA to place repair

circuits for in-field FPGA repair. These unused portions are usually caused by

manufacturing defects. Normal place and route algorithms ignore them because they

do not provide the necessary complete functionality. Despite this researchers at

Brigham Young University identified times and areas where the unused resources can

be taken into account, and even utilized to create fault tolerance. Their method

identified repair circuits that could be used when necessary before all possible

permanent faults occurred. This method fixes faults in programmable logic, fixed

Page 13: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 12 of 22

functions such as DSP and BRAM, and routing resources. These possible alternate

repair circuits when either cached locally or are made available remotely can create

a high degree of fault tolerance in the target device. One disadvantage though is that

an incredibly large amount of computation must be done before hand in order to

create these alternate repair configurations. In their paper they show how this all

works on a Xilinx Virtex 4 FPGA, and outline three possible placement algorithms for

these repair circuits

Years ago researchers published an important paper on efficiently supporting fault

tolerance in FPGAs. Their method used a set of tiles to provide the desired level of

fault tolerance. They investigated the physical resources, interface, and net-list that

each tile would require. Further they found that having multiple configurations for

each tile would give increased reliability under various operating circumstances.

They showed that partially reconfiguring a FPGA to an alternate configuration in

response to a fault mitigated many different kinds of faults. This led to many other

great discoveries in this area that other researchers have since been able to capitalize

on. This kind of fault tolerance was hierarchical, with both tiles and atomic fault-

tolerant blocks.

A group, at the University of Florida, found that single-event upset hardened FPGAs

have reconfiguration limitations and small capacities, and as such are only suitable

for a limited range of applications. To combat this they devised a reconfigurable fault

tolerance framework. This framework enabled their system to dynamically self-adapt

to whatever the current upset rate happened to be. Further their method has two

modes, one for high reliability and one for high performance. Their system has the

ability to decide which mode to operate in, depending on the current upset rate.

Because many hazardous environments, such as space, have a high degree of

variability, their system can make smart decisions based on the current

circumstances. This self-adaptation allows their system to operate in many different

environments, as well as operate most efficiently in a given environment. When there

Page 14: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 13 of 22

are many faults entering the system from external sources it can run in a highly

reliable fashion. Once the cause of these faults settles the system can then switch to

high performance mode and get a lot of work done while it is able.

Hardware Level Fault Tolerance:

Hardware Level tolerance: performs a correction such that the FPGA remains

unchanged for the purposes of the configuration. The device retains its original

number and arrangement of useable logic clusters and interconnects. The regular

structure of FPGAs makes them suitable candidates for hardware level repair, using

methods similar to those used for defect tolerance in memory devices. In the event of

a fault, a part of the circuit can be mapped to another part with no change in function.

Hardware level repair has the advantage of being transparent to the configuration.

This makes repair a simple process, as the repair controller does not need any

knowledge of the placement and routing of the design. Another benefit is that the

timing performance of the repaired FPGA can be guaranteed, as any faulty element

will be replaced by a pre-determined alternative. Hardware level fault tolerance has

a drawback in that it can tolerate just a low number of faults for a given overhead and

there are likely to be certain patterns of faults which cannot be tolerated. The most

common basis for hardware level repair is column/row shifting . Multiplexers are

introduced at the ends of lines of cells, allowing a whole row or column to be bypassed

by shifting across to a spare line of cells at the end of the array. If the FPGA is bus-

based, the shifted cells can connect to the same lines of interconnect. For segmented

interconnect, bypass sections need to be added to skip the faulty row / column. This

method can be found today in some commercial devices Adding more bypass

connections and multiplexers allows greater flexibility for tolerating multiple faults

and makes more efficient use of spare resources .

Page 15: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 14 of 22

Configuration Level fault tolerance:

Configuration level repair exploits two key features in FPGAs; reconfiguration and the

availability of unused resources. Configuration level repair strategies can be divided

into three subclasses:

Alternative configurations:

A straightforward way of achieving fault tolerance is to pre-compile alternative

configurations. The FPGA is split into tiles, each with its own set of configurations

which have a common functionality and . Fault tolerance is achieved by replacing a

configuration tile with an alternative in which the faulty resource is not used. This

method requires little run-time computation as the placement and routing is already

done. This strategy performs relatively poorly in terms of area efficiency and fault

pattern. It is dependent on there being a configuration available in which any given

resource is set aside as a spare. If only a small amount of spare resource is available

then a large number of configurations are needed to cover all possible faults.

Incremental mapping, placement and routing:

A simple method of tolerating faults in logic clusters exists if the cluster can be

reconfigured to work , Cluster reconfiguration is simple to evaluate and does not have

a significant effect on design timing, but there may be cases where reconfiguration is

not possible, for example if an output register is faulty. If there are spare clusters, then

these can be used to replace faulty ones. Cluster reconfiguration is carried out in

preference and faulty clusters can be reused by a different function if the fault will not

be manifest. It is also important to consider faults in interconnect. Incremental

mapping, placement and routing provide a high degree of flexibility for dealing with

random fault patterns, especially when cluster reconfiguration and pebble shifting

are used together. However, this comes at the cost of increased computational effort

for the repair which must be carried out in the field. Spare clusters and interconnect

are allocated at design-time in order to reduce the complexity of a repair in the field

and to provide guaranteed timing performance. This technique, however, reduces

Page 16: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 15 of 22

flexibility and overhead efficiency. In [18] a more coarse-grained approach is taken.

Clusters are arranged into networks such that most faults can be repaired by

reconfiguring the network and leaving the wider placement unchanged.

Evolutionary algorithms:

Reconfiguration makes FPGAs well suited to evolutionary algorithms. Configurations

compete for correctness and those which are faulty are mutated. Although this

approach allows a large degree of flexibility with the number and distribution of

faults that can be tolerated, the area and computational overhead required is very

large.There is also no guarantee of how long a solution will take to evolve or what its

timing performance will be.

System level Fault Tolerance :

works at a higher level. When adesign is highly modular, a fault can be tolerated by the use

of a spare functional block or by providing degraded performance . Such methods are not

considered in more detail here, as they are not limited in application to FPGAs.

Page 17: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 16 of 22

Conclusion:

FPGAs are a very important computing resource for many different fields in the world today.

Their reconfigurability allows for incredible flexibility and reuse. But this benefit comes with

a cost. FPGAs are used in applications which demand a high amount of reliability and

availability. This is why many people over the years have researched the effects of faults

within FPGA systems, and different methods of detecting and diagnosing said faults. Once

the faults have been identified there are myriad ways to provide for fault tolerance.

Clearly, no single FT methodology is significantly better than the others, The best general

solution to FPGA FT is probably a combination of both DL and CL fault tolerance

methodologies. The most likely future advancement in fault tolerance will be in the area of

self-adaptation in the presence of faults. This will allow FPGAs to be fault tolerant no matter

the environment. As for detection and diagnosis the focus is always on improving speed,

coverage and overhead.

Appendix: Prototype for Fault Correction in

FPGA based systems.

Problem Statement: Faulting Correcting in a 8 Bit Kogge-Stone Adder

Solution: This is done by correcting the fault in each of the individual units of the

Kogge-stone adder.

Lets start with defining a

Kogge-Stone adder: adder, which is classified as a parallel prefix adder, has a critical

path on the order of (where is the width of the adder in bits). The regularity of its

structure makes it suitable for VLSI designs as well as FPGA implementations

The Kogge–Stone adder is a parallel prefix form carry look-ahead adder. It generates the carry

signals in O(log n) time, and is widely considered the fastest adder design possible. It is the

common design for high-performance adders in industry

Page 18: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 17 of 22

An example of a 4-bit Kogge–Stone adder is shown below. Each vertical stage produces a

"propagate" and a "generate" bit, as shown. The culminating generate bits (the carries) are

produced in the last stage (vertically), and these bits are XOR'd with the initial propagate after

the input (the red boxes) to produce the sum bits. E.g., the first (least-significant) sum bit is

calculated by XORing the propagate in the farthest-right red box (a "1") with the carry-in (a "0"),

producing a "1". The second bit is calculated by XORing the propagate in second box from the

right (a "0") with C0 (a "0"), producing a "0".

Example of a 4-bit Kogge–Stone adder with zero carry-in.

The Kogge-Stone adder is classified as a parallel prefix adder since the generateand

the propagate signals are precomputed. In a tree-based adder, carries are generated

in tree and fast computation is obtained at the expense of increased area and power.

The main advantage of this design is that the carry tree reduces the logic depth of

the adder by essentially generating the carries in parallel. The parallel-prefix adder

becomes more favorable in terms of speed due to the O(log2n) delay through the

carry path compared to O(n) for the RCA. The Kogge-Stone adder is widely used in

high-performance 32-bit, 8

64-bit, and 128-bit adders as it reduces the critical path to a great extent compared

to the ripple carry adder.

This operator works on the generate and propagate pairs as defined by,

Page 19: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 18 of 22

(gL, pL) (gR, pR) = (gL + pL gR, pL pR)

where gL, pL are the left input generate and propagate pairs and gR, pR are the right

input generate and propagate pairs to the cell. For example, in a 4-bit carry

lookahead adder, the carry combination equation can be expressed as,

c4 = (g4, p4) [ (g3, p3) [(g2, p2) (g1, p1)] ] (2.6)

= (g4, p4) [ (g3, p3) [(g2 + p2 g1, p2 p1)] ]

:

:

= g4 + p4 g3 + p4 p3 g2 + p4 p3 p2 g1

Since the fco obeys the associativity property, the expression can be reordered to

yield parallel computations in a tree based structure [7],

c4 = [(g4, p4) (g3, p3)] [(g2, p2 ) (g1, p1)] (2.7)

VHDL CODE:

VHDL Code for GeneratePropogate block correction in 8-bit Kogge-

Stone Fault Correcting Adder

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use IEEE.STD_LOGIC_ARITH.ALL;

use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity GPblock is

port( a,b: in std_logic;

g,p: out std_logic);

end GPblock;

architecture Behavioral of GPblock is

Page 20: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 19 of 22

begin

g <= a and b;

p <= a xor b;

Behavioral;

VHDL Code for mux correction in 8-bit Kogge-Stone Fault Correcting

Adder

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

entity mux is

port (x,y: in std_logic_vector(1 downto 0);

z : out std_logic_vector(1 downto 0);

sel: in std_logic);

end mux;

architecture Behavioral of mux is

constant delay: time :=100ns;

begin

mux_proc : process(x,y,sel)

variable temp : std_logic_vector(1 downto 0);

begin

Page 21: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 20 of 22

case sel is

when '0'=> temp:=x;

when '1'=> temp:=y;

when others => temp :="XX";

end case;

z<= temp;

end process mux_proc;

end Behavioral;

VHDL Code for sum correction in 8-bit Kogge-Stone Fault Correcting

Adder

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use IEEE.STD_LOGIC_ARITH.ALL;

use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity sum is

port( p,c: in std_logic;

s : out std_logic);

end sum;

architecture Behavioral of sum is

begin

s <= (p xor c);

end Behavioral;

Page 22: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 21 of 22

Delay calculations for 8bit Kogge stone adder

References:

List of References:

1. Jason. A. Cheatham , John M. Emmert and Stan Baumgart 2006. A Survey of Fault Tolerant

Methodologies for FPGAs

2. Jano Gebelein, Heiko Engel and Udo Kebschull 2010. FPGA fault tolerance in radiation

susceptible environments

3. Khaled Elshafey, Jan Hlavicka ˇ 2002 . FAULT-TOLERANT FPGA-BASED SYSTEMS

4. Daniel Fisher, Addison Floyd . Survey of Detection, Diagnosis, and Fault Tolerance

Methods in FPGAs

5. Wei-Je Huang and Edward J. McCluskey . Column-Based Precompiled Configuration

Techniques for FPGA Fault Tolerance

6. Edward Stott, Pete Sedcole, Peter Y. K. Cheung 2007. FAULT TOLERANT METHODS FOR

Page 23: Fault Tolerance in an FPGA System

FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh

Page 22 of 22

RELIABILITY IN FPGAs

BasicsFPGA :https://www.google.com/search?biw=1536&bih=758&q=fpga+basics&revid=

1721831489&sa=X&ei=LEw0VPK7LIuxyAT6loLICg&ved=0CG8Q1QIoBQ

7. Technical Blogs:

http://www.pe-ip.com/ by Marc Perron http://billauer.co.il/blog/category/fpga/

by Eli Billauerhttp://fpgablog.com/

8. http://www.dtic.mil/dtic/tr/fulltext/u2/a462520.pdf

9. http://www.ijetch.org/papers/424-C049.pdf

10.http://www.academia.edu/5178678/High_Speed_Fault_Injection_Tool_Implemented_Wi

th_Verilog_HDL_on_FPGA_for_Testing_Fault_Tolerance_Designs

Others:

Google

Wikipedia

A text book for Hardware description language Verilog and VHDL by A K pedroni