6
1 Abstract - This paper proposes a novel fault tolerant algorithm for tolerating stuck-at-faults in digital circuits. We consider in this paper single stuck-at type faults, occurring either at a gate input or at a gate output. A stuck-at-fault may adversely affect on the functionality of the user implemented design. A novel fault tolerant design based on hardware redundancy (replication) is presented here for single fault model to tolerate transient as well as permanent faults. The design is also suitable to be used for highly dependable systems implemented by means of Field Programmable Gate Arrays (FPGAs) at RTL level. This approach offers the possibility of using larger and more cost effective devices that contain interconnect defects without compromising on performance or configurability. The algorithm presented here demonstrates the fault tolerance capability of the design and is implemented for a full adder circuit but can be generalized for any other digital circuit. Using exhaustive testing the functioning of all the three full adders can be easily verified. In case of occurrence of stuck-at- faults; the circuit will configure itself to select the other fault free outputs. We have evaluated our novel fault tolerant technique (NFT) in five different circuits: full adder, encoder, counter, shift register and microprocessor. The proposed design approach scales well to larger digital circuits also and does not require fault detection. We have also presented and compared the results of triple modular redundancy (TMR) method with our technique .All possible faults are tested by injecting the faults using a multiplexer. Index Terms - Fault tolerance, fault injection, field programmable gate arrays (FPGA), reconfiguration, triple modular redundancy (TMR), novel fault tolerant technique (NFT). I. INTRODUCTION Reliability and performance are the two important factors becoming major concern for next generation very deep sub-micron systems. Their reduced voltage supplies and therefore noise margins, together with their reduced internal capacitances, will dramatically increase their susceptibility and sensitivity to radiations and noise in general, making system’s failures extremely likely [1], [2]. As a consequence, not only systems oriented to mission critical applications (e.g., space, avionic, transport, etc.) will reinforce the use of fault-tolerance, but also general purpose systems implemented by next generation very deep sub-micron technologies, including FPGAs, will require the use of some form of fault tolerance [4], [5]. There are two fundamentally different approaches that can be taken to increase the reliability of computing systems. The first approach is called fault prevention (also known as fault intolerance) and the second, fault tolerance. In the traditional fault prevention approach, the objective is to increase the reliability by a priori elimination of all faults. Since this is almost impossible to achieve in practice, the goal of fault prevention is to reduce the probability of a system failure to an acceptable low value. In the fault tolerance approach, faults are expected to occur during computation, but their effects are automatically countered by incorporating redundancy – that is, additional resources - so that valid computation can continue even in the presence of faults. These resources may include additional hardware (hardware redundancy), the addition of redundant information (information redundancy), additional software (software redundancy), more time (time redundancy), or a combination of all these. They are redundant in the sense that they can be omitted from a system without affecting its normal operation. Most of the early work in fault-tolerant system A Novel Fault Tolerant Design and an Algorithm for Tolerating Faults in Digital Circuits R.V.Kshirsagar 1 , R.M.Patrikar 2 1 Priyadarshini College of Engg. & Arch., Nagpur. 440019 India 2 Visvesvaraya National Institute of Tech, Nagpur. 440022 India e-mail: [email protected] , [email protected]

[IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

  • Upload
    rm

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

1

Abstract - This paper proposes a novel fault tolerant algorithm for tolerating stuck-at-faults in digital circuits. We consider in this paper single stuck-at type faults, occurring either at a gate input or at a gate output. A stuck-at-fault may adversely affect on the functionality of the user implemented design. A novel fault tolerant design based on hardware redundancy (replication) is presented here for single fault model to tolerate transient as well as permanent faults. The design is also suitable to be used for highly dependable systems implemented by means of Field Programmable Gate Arrays (FPGAs) at RTL level. This approach offers the possibility of using larger and more cost effective devices that contain interconnect defects without compromising on performance or configurability. The algorithm presented here demonstrates the fault tolerance capability of the design and is implemented for a full adder circuit but can be generalized for any other digital circuit. Using exhaustive testing the functioning of all the three full adders can be easily verified. In case of occurrence of stuck-at-faults; the circuit will configure itself to select the other fault free outputs. We have evaluated our novel fault tolerant technique (NFT) in five different circuits: full adder, encoder, counter, shift register and microprocessor. The proposed design approach scales well to larger digital circuits also and does not require fault detection. We have also presented and compared the results of triple modular redundancy (TMR) method with our technique .All possible faults are tested by injecting the faults using a multiplexer. Index Terms - Fault tolerance, fault injection, field programmable gate arrays (FPGA), reconfiguration, triple modular redundancy (TMR), novel fault tolerant technique (NFT).

I. INTRODUCTION

Reliability and performance are the two important

factors becoming major concern for next generation very deep sub-micron systems. Their reduced voltage supplies and therefore noise margins, together with their reduced internal capacitances, will dramatically increase their susceptibility and sensitivity to radiations and noise in general, making system’s failures extremely likely [1], [2]. As a consequence, not only systems oriented to mission critical applications (e.g., space, avionic, transport, etc.) will reinforce the use of fault-tolerance, but also general purpose systems implemented by next generation very deep sub-micron technologies, including FPGAs, will require the use of some form of fault tolerance [4], [5].

There are two fundamentally different approaches that can be taken to increase the reliability of computing systems. The first approach is called fault prevention (also known as fault intolerance) and the second, fault tolerance. In the traditional fault prevention approach, the objective is to increase the reliability by a priori elimination of all faults. Since this is almost impossible to achieve in practice, the goal of fault prevention is to reduce the probability of a system failure to an acceptable low value. In the fault tolerance approach, faults are expected to occur during computation, but their effects are automatically countered by incorporating redundancy – that is, additional resources - so that valid computation can continue even in the presence of faults. These resources may include additional hardware (hardware redundancy), the addition of redundant information (information redundancy), additional software (software redundancy), more time (time redundancy), or a combination of all these. They are redundant in the sense that they can be omitted from a system without affecting its normal operation.

Most of the early work in fault-tolerant system

A Novel Fault Tolerant Design and an Algorithm for Tolerating Faults in Digital

Circuits R.V.Kshirsagar1, R.M.Patrikar2

1Priyadarshini College of Engg. & Arch., Nagpur. 440019 India 2Visvesvaraya National Institute of Tech, Nagpur. 440022 India e-mail: [email protected] , [email protected]

Page 2: [IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

2

design was motivated by aerospace applications and in particular by the requirements for computers to be able to operate unattended for long period of time. While this application is still an important one, fault tolerance is now regarded as a desirable and in some cases an essential feature of a wide range of computing systems, especially in applications where reliability, availability, and safety are of vital importance. For commercial systems, nonredundant (i.e. fault prevention) techniques have been preferred mainly because a redundant design results in increased overhead in terms of area, power consumption and the like. Reliability is improved by using reliable components, refined interconnections, and so on. However, this approach has limited effectiveness in counteracting faults in hardware and reducing the number of applications in which system failures once per day or once per week are not acceptable. A fault-tolerant design can provide dramatic improvements in system availability and lead to substantial reduction in maintenance costs as a consequence of fewer system failures.

To tolerate permanent faults in system hardware, redundancy is the most commonly used approach, though it results in increased overhead in terms of area, power consumption and the like. The common form of modular redundancy in practical systems is the Triple-Modular Redundancy (TMR) used for single event upset (SEU) mitigation [6]. The basic concept of triple redundancy is that a sensitive circuit can be hardened to SEUs by implementing three copies of the same circuit and performing a bit-wise ‘majority vote’ on the output of the triplicate circuit as shown in Fig.1.

The circuit in question can be a mere flip flop or an entire logic design. The function of the majority voter is to output the logic value (‘1’ or ‘0’) that corresponds to at least two of its inputs. For example, if two or more of the voter’s three inputs are a ‘1’, then the output of the voter is a ‘1’. If the inputs of the voter are labeled A, B, and C, and the output V, respectively, then the boolean equation for the voter is: V= AB + AC + BC.

Fig. 1. Triple modular redundancy with voter

The Truth-Table for majority vote circuit is shown in Table 1.

TABLE 1

MAJORITY VOTE TRUTH-TABLE The logic gate representation of the majority voter is shown in Fig. 2.

Fig. 2. Majority voter circuit

Testing is unlikely to detect the presence of short-lived transient faults [2].As per the field studies , such non-permanent faults are the dominant cause of very large scale integration(VLSI)circuits/system failures (82-98%) [3].Modeling intermittent and transient faults require statistical data on their probability of occurrence, which are usually not available.

The algorithm and design presented here deal not only with the permanent stuck-at-faults but also the transient faults in digital circuits [7]. The typical faults affecting interconnections are their breaking, known as opens, and unwanted connections of points, known as shorts. A short between a signal line and ground or power can make the signal remain at a fixed voltage level. Such fault is logically modeled as the signal being stuck-at corresponding fixed logic value v (0, 1) and it is denoted by s-a-v, i.e., the line has always the same logic value v, regardless of the inputs that would normally affect it.

We have designed a circuit (NFT) as shown in Fig.3, based on the algorithm presented here, which replaces the voter circuit of TMR method. The drawback of the TMR circuit is that the voter circuit is not 100% reliable. Whereas our design is 100% fault tolerant and reliable.

Duplicated Circuit

V O T E R

Original Circuit

Duplicated Circuit

I n p u t s

Output

A B C V 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1

V

C

A

B

Page 3: [IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

3

We have also presented the comparisons between

TMR technique and our technique (NFT) in terms of area overhead and performance. This circuit is also capable of tolerating the bridging faults, i.e., the shorts between two signal lines.

II. ALGORITHM

The algorithm presented in this paper can be generalized for any digital circuit. The circuit under test is denoted as CUT here. It is assumed that the design has: ‘n’ no. of primary inputs,‘m’ no. of primary outputs, and a list of combinational circuits. We have used the following notations for the main signals, components & circuits: • The primary inputs are denoted as (i1,i2,i3,…,in) • The primary outputs are denoted as

(y1,y2,y3,...,ym) • The intermediate inputs are denoted as

(is1,is2,...,isa) • The intermediate outputs are denoted as

(os1,os2,..,osb) • Circuit under test (CUT) • EOR – Ex-or gate • PE – Priority encoder • MUX – Multiplexer

1. Algorithm single-stuck-at-fault 2. Inputs : (i1,i2,i3,……..……,in) and CUT 3. Outputs : (y1,y2,y3,…………,ym) 4. begin 5. for CUT do 6. create two more copies CUT and 2*m

7. connect the similar outputs of CUT and its copies to the inputs of EORs. 8. end for 9. for all EOR do 10. create m copies of PE 11. connect similar outputs of EORs to the inputs of PEs 12. end for 13. for all PE do 14. create m copies of MUX 15. connect the outputs of PEs to select lines of MUXs 16. connect the similar outputs of CUT and its copies to the input lines of MUXs 17. end for 18. for all inputs i1 to in do 19. check the outputs y1 to ym at MUXs 20. end for 21. end algorithm

This algorithm can be used to generate a fault tolerant design for any digital circuit. The algorithm is presented for a single fault simulation model, which consists of following steps:

1. Assign values to the primary inputs, i.e., (i1, i2, i3,…..,in), 2. Choose the interconnect or output of CUT randomly, 3. Perform logic simulation and record the monitored Values, 4. Force inverse value (inject fault) on chosen interconnect, 5. Again perform the logic simulation and record the monitored values, 6. Compare the monitored values in case (3) and

(5).

III. METHODOLOGY

This paper presents a novel technique to tolerate stuck at faults at the outputs of any digital circuit under test (CUT). If there is a fault at one of the outputs then circuit itself detects the fault and configures to provide the fault free output. In place of the voter circuit we have used a novel circuit, as shown in Fig.3, consisting of ex-or gates, priority encoder and multiplexer to produce fault free output at any moment of time. This approach allows achieving fault tolerance with respect to all possible faults. The idea is implemented for a full adder circuit as shown in Fig.4.

EOR

Fig 3. Novel fault tolerant circuit

V

i1

i2 pe

o

s1

s2

mux 2:1

A

B

C

Page 4: [IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

4

Fig. 4. Circuit to tolerate interconnect stuck-at-faults The circuit under test is triplicated. The similar outputs of the CUT and its copies are fed to the ex-or gates. Again the similar outputs of the ex-or gates are fed to the priority encoders .The outputs of encoders are fed to two different multiplexers as select lines. The inputs to these multiplexers are the similar outputs of the two full adders, i.e., fa1 and fa3.We have used d flip flops (d-ffs) at the outputs of the CUTs so as to synchronize the outputs .It is assumed that only one fault occurs at a time. All possible faults are tested exhaustively by injecting the faults (0, 1) in all the nodes of the CUT using multiplexer as shown in Fig.5.

Fig 5. Logic node

The multiplexer can be inserted at the output net of the circuit to be tested. This technique can be generalized and implemented for tolerating faults in

any other circuit.

IV. TESTING STRATEGY

Three bits input is applied to the full adder and its copies. Outputs of full adders (fa1, fa2 and fa3) are s1, s2, s3 and c1, c2, c3, i.e., sum and carry respectively. This system is designed in such a way that it tests the outputs of full adders and if any of them is stuck at any fault level then the circuit selects the fault free output through priority encoder which is finally propagated to the output (sum_o / carry_o) through mux1 / mux2.The truth table of full adder for exhaustive testing is shown in Table 2.

If the input (abci) is “001” then outputs s1 , s2 and s3 are equal to ‘1’ , and since both the inputs to the ex-or gates are same , the outputs s4 and s5 of ex-or gates will be ‘0’ which are connected to the inputs of priority encoder .The priority encoder is designed in such a way that whenever i1/i3 is ‘0’ then the outputs of priority encoders will be assigned a value ‘0’ and whenever i2/i4 is ‘0’ then the outputs of priority encoders will be assigned a value ‘1’ , which are connected to the select lines(o1/o2) of mux1/mux2, hence s1(‘1’)/c1(‘0’) is propagated to the output sum_o /carry_o through mux1/mux2 , which is the correct (fault free) output. Here the priority is given

data

‘1’

‘0’

fault enable

node

s1 q1

reset

fa1

a1

b1

fa2

fa3

ci1 d q

d q

d q

d q

d q

d q

i1

i2

pe1

i3

i4

pe2

o2

o1

clk

sum_o

carry_o

S2

s3

c1

c2

c3

m1

m2

s4

s5

c4

c5

q2

q3

q4

q5

q6

a2

ci2

b2

ci3

b3

a3

Page 5: [IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

5

to i1/i3 inputs. Now assume that s1 is s-a-0, then the output s4 of ex-or gate becomes ‘1’ (as s1=q1=‘0’ and s2=s3=‘1’), i2 becomes ‘0’, therefore o1 will be ‘1’ which is the select signal for mux1 and hence s3 (‘1’) is propagated to the output sum_o through mux1 making it the fault free/correct output.

Now let us consider the input as “011”.In this case outputs s1 , s2 and s3 will be ‘0’, outputs s4 and s5 of ex-or gates will also be ‘0’ and as i1 is ‘0’ ,the select line of mux1 will be ‘0’ making the final sum output ( sum_o )as ‘0’ . Now assume that s1 is s-a-1, then the output s4 of ex-or gate becomes ‘1’ (as s1=q1=‘1’ and s2=s3=‘0’),i2 becomes ‘0’ , therefore o1 will be ‘1’ which is the select signal for mux1 and hence s3(‘0’) is propagated to the output sum_o through mux1 making it the fault free/correct output.

Similar testing and detecting technique is applied using the remaining test vectors, i.e., “000” and “111” and by inserting the faults at either c1 or c2 or c3 for checking the carry output carry_o at the output of mux2. TABLE 2 TRUTH-TABLE (EXHAUSTIVE TESTING)

This circuit can also be tested for any stuck at fault using pseudo- random testing as shown in Table 3.

TABLE 3

TRUTH-TABLE (PSEUDO-RANDOM TESTING)

TABLE 4

FAULT COVERAGE USING NOVEL FAULT TOLERANT TECHNIQUE (NFT)

TABLE 5 COMPARISON OF THREE TECHNIQUES IN TERMS OF AREA AND PERFORMANCE FOR FULL ADDER CIRCUIT

USING ISE 8.2i

Input Output a b ci s co 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1

Sr. No. Circuit

No. of injected faults

No. of tolerated

faults

Fault tolerance

(%)

1. Full Adder 46 46 100

2. 8 to 3 Encoder 96 96 100

3. 3-Bit shift register 20 20 100

4. 3-Bit ripple counter 42 42 100

5. 8-Bit microprocessor 168 168 100

Sr. No.

Fault tolerance technique

Delay (ns)

No. of i/o

pads

No. of nets

No. of instances

No. of LUTs (F.G.)

No. of CLBs

No. of i/ps

No. of

o/ps

Total equiv. gate

counts

1. None 8.369 05 10 09 02 01 03 02 12

2. T.M.R. 13.364 11 29 22 09 05 09 02 54

3. N.F.T. 9.977 11 28 21 08 05 09 02 48

Input Output

a b ci s co

0 0 0 0 0

0 1 1 0 1

0 0 1 1 0

1 1 1 1 1

Page 6: [IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - A novel fault tolerant design

6

We have compared the three implementations of full adder circuit to check area, performance and power dissipation in Xilinx’s XCV50-6PQ240 FPGA: normal CUT, TMR and our technique (NFT) for permanent as well as transient faults. Power dissipation in each circuit was evaluated using Xilinx’s XPower tool. Table 4 shows the results in terms of fault coverage for the five circuits mentioned earlier. Table 5 shows the results in terms of area and performance for three implementations of full adder circuit. Using our technique we have reduced the area to some extent. In terms of performance, the normal full adder circuit without fault tolerance had a maximum delay of 8.369ns, the TMR version had a delay of 13.364 ns and our NFT had a delay of 9.977ns, representing an improvement of 25.344% in performance using our technique. The improvement in performance and power dissipation was more when used over larger/bigger circuits.

Table 6 shows the comparison of three implementations in terms of power dissipation. Total power dissipation using NFT was also less as compared to TMR method. This design is also capable of producing the fault free outputs even in case of occurrence of any stuck at faults at the inputs of priority encoders / multiplexers and at the output of priority encoders.

TABLE 6

COMPARISON IN TERMS OF POWER DISSIPATION

V. SIMULATION AND SYNTHESIS The fault-tolerance ability of the design was

tested and verified by means of VHDL simulation programs. We have written VHDL codes for top entity and various components. We have used Modelsim SE 6.2e for verifying the design. The circuits were synthesized and tested for Xilinx’s XCV50-6PQ240 FPGA. Synthesis was done using the Xilinx’s synthesizer tool (XST) of ISE Foundation series 8.2i.

VI. CONCLUSION

Many techniques have been suggested in

past to detect and tolerate interconnect stuck at faults. The technique presented in this paper discusses the stuck at faults with a novel algorithm. The methodology could be generalized for other such circuits also in the similar way. We have successfully implemented and tested this design onto Xilinx’s XCV50-6 PQ240 FPGA.

REFERENCES

[1] M.K. Stojcev, G.Lj. Djordjevic, T.R.Stankovic,

“Implementation of self- checking two-level combinational logic on FPGA and CPLD circuits”,journal of Microelectronics Reliability, issue 44, 2004, pp. 173-178.

[2] Lala PK., “Self-checking and fault-tolerant digital system design”, San Francisco: Morgan Kuffman Publishers, 2001.

[3] Castilo X et al, “Boundary-scan test: A practical approach”, Dordecht Kluwer Academic Publishers, 1993.

[4] Monica Alderighi’, Sergio D’Angelol, Cecilia Metra’, and Giacomo R.Sechi “Novel Fault-Tolerant Adder Design for FPGA - Based Systems” , IEEE Proceedings on On–Line Testing Workshop, 2001, Volume 7, 2001, pp. 54 – 58.

[5] J.Lach and W. H. Mangione-Smith and M. Potkonjak, “Low Overhead Fault-Tolerant FPGA Systems”, IEEE Trans. On VLSI Systems, 1998, 6(2), pp. 212 - 221, June.

[6] Sandi Habinc, “Functional Triple Modular Redundancy (FTMR)”, Design and Assessment Report, Gaisler Research, FPGA- 003-01, ver.0.2, 2002, pp. 1-55, December.

[7] F.Hanchek and S. Dutt, “Methodologies for Tolerating Cell and Interconnect Faults in FPGAs”, IEEE Transactions on Computers, 1998, Vol. 47, pp. 15 - 33, January.

Sr. No.

Fault tolerance technique

Power dissipation(mW)

1. None 127.77

2. T.M.R. 135.73

3. N.F.T. 129.54