Upload
rm
View
213
Download
1
Embed Size (px)
Citation preview
1
Abstract - This paper proposes a novel fault tolerant algorithm for tolerating stuck-at-faults in digital circuits. We consider in this paper single stuck-at type faults, occurring either at a gate input or at a gate output. A stuck-at-fault may adversely affect on the functionality of the user implemented design. A novel fault tolerant design based on hardware redundancy (replication) is presented here for single fault model to tolerate transient as well as permanent faults. The design is also suitable to be used for highly dependable systems implemented by means of Field Programmable Gate Arrays (FPGAs) at RTL level. This approach offers the possibility of using larger and more cost effective devices that contain interconnect defects without compromising on performance or configurability. The algorithm presented here demonstrates the fault tolerance capability of the design and is implemented for a full adder circuit but can be generalized for any other digital circuit. Using exhaustive testing the functioning of all the three full adders can be easily verified. In case of occurrence of stuck-at-faults; the circuit will configure itself to select the other fault free outputs. We have evaluated our novel fault tolerant technique (NFT) in five different circuits: full adder, encoder, counter, shift register and microprocessor. The proposed design approach scales well to larger digital circuits also and does not require fault detection. We have also presented and compared the results of triple modular redundancy (TMR) method with our technique .All possible faults are tested by injecting the faults using a multiplexer. Index Terms - Fault tolerance, fault injection, field programmable gate arrays (FPGA), reconfiguration, triple modular redundancy (TMR), novel fault tolerant technique (NFT).
I. INTRODUCTION
Reliability and performance are the two important
factors becoming major concern for next generation very deep sub-micron systems. Their reduced voltage supplies and therefore noise margins, together with their reduced internal capacitances, will dramatically increase their susceptibility and sensitivity to radiations and noise in general, making system’s failures extremely likely [1], [2]. As a consequence, not only systems oriented to mission critical applications (e.g., space, avionic, transport, etc.) will reinforce the use of fault-tolerance, but also general purpose systems implemented by next generation very deep sub-micron technologies, including FPGAs, will require the use of some form of fault tolerance [4], [5].
There are two fundamentally different approaches that can be taken to increase the reliability of computing systems. The first approach is called fault prevention (also known as fault intolerance) and the second, fault tolerance. In the traditional fault prevention approach, the objective is to increase the reliability by a priori elimination of all faults. Since this is almost impossible to achieve in practice, the goal of fault prevention is to reduce the probability of a system failure to an acceptable low value. In the fault tolerance approach, faults are expected to occur during computation, but their effects are automatically countered by incorporating redundancy – that is, additional resources - so that valid computation can continue even in the presence of faults. These resources may include additional hardware (hardware redundancy), the addition of redundant information (information redundancy), additional software (software redundancy), more time (time redundancy), or a combination of all these. They are redundant in the sense that they can be omitted from a system without affecting its normal operation.
Most of the early work in fault-tolerant system
A Novel Fault Tolerant Design and an Algorithm for Tolerating Faults in Digital
Circuits R.V.Kshirsagar1, R.M.Patrikar2
1Priyadarshini College of Engg. & Arch., Nagpur. 440019 India 2Visvesvaraya National Institute of Tech, Nagpur. 440022 India e-mail: [email protected] , [email protected]
2
design was motivated by aerospace applications and in particular by the requirements for computers to be able to operate unattended for long period of time. While this application is still an important one, fault tolerance is now regarded as a desirable and in some cases an essential feature of a wide range of computing systems, especially in applications where reliability, availability, and safety are of vital importance. For commercial systems, nonredundant (i.e. fault prevention) techniques have been preferred mainly because a redundant design results in increased overhead in terms of area, power consumption and the like. Reliability is improved by using reliable components, refined interconnections, and so on. However, this approach has limited effectiveness in counteracting faults in hardware and reducing the number of applications in which system failures once per day or once per week are not acceptable. A fault-tolerant design can provide dramatic improvements in system availability and lead to substantial reduction in maintenance costs as a consequence of fewer system failures.
To tolerate permanent faults in system hardware, redundancy is the most commonly used approach, though it results in increased overhead in terms of area, power consumption and the like. The common form of modular redundancy in practical systems is the Triple-Modular Redundancy (TMR) used for single event upset (SEU) mitigation [6]. The basic concept of triple redundancy is that a sensitive circuit can be hardened to SEUs by implementing three copies of the same circuit and performing a bit-wise ‘majority vote’ on the output of the triplicate circuit as shown in Fig.1.
The circuit in question can be a mere flip flop or an entire logic design. The function of the majority voter is to output the logic value (‘1’ or ‘0’) that corresponds to at least two of its inputs. For example, if two or more of the voter’s three inputs are a ‘1’, then the output of the voter is a ‘1’. If the inputs of the voter are labeled A, B, and C, and the output V, respectively, then the boolean equation for the voter is: V= AB + AC + BC.
Fig. 1. Triple modular redundancy with voter
The Truth-Table for majority vote circuit is shown in Table 1.
TABLE 1
MAJORITY VOTE TRUTH-TABLE The logic gate representation of the majority voter is shown in Fig. 2.
Fig. 2. Majority voter circuit
Testing is unlikely to detect the presence of short-lived transient faults [2].As per the field studies , such non-permanent faults are the dominant cause of very large scale integration(VLSI)circuits/system failures (82-98%) [3].Modeling intermittent and transient faults require statistical data on their probability of occurrence, which are usually not available.
The algorithm and design presented here deal not only with the permanent stuck-at-faults but also the transient faults in digital circuits [7]. The typical faults affecting interconnections are their breaking, known as opens, and unwanted connections of points, known as shorts. A short between a signal line and ground or power can make the signal remain at a fixed voltage level. Such fault is logically modeled as the signal being stuck-at corresponding fixed logic value v (0, 1) and it is denoted by s-a-v, i.e., the line has always the same logic value v, regardless of the inputs that would normally affect it.
We have designed a circuit (NFT) as shown in Fig.3, based on the algorithm presented here, which replaces the voter circuit of TMR method. The drawback of the TMR circuit is that the voter circuit is not 100% reliable. Whereas our design is 100% fault tolerant and reliable.
Duplicated Circuit
V O T E R
Original Circuit
Duplicated Circuit
I n p u t s
Output
A B C V 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1
V
C
A
B
3
We have also presented the comparisons between
TMR technique and our technique (NFT) in terms of area overhead and performance. This circuit is also capable of tolerating the bridging faults, i.e., the shorts between two signal lines.
II. ALGORITHM
The algorithm presented in this paper can be generalized for any digital circuit. The circuit under test is denoted as CUT here. It is assumed that the design has: ‘n’ no. of primary inputs,‘m’ no. of primary outputs, and a list of combinational circuits. We have used the following notations for the main signals, components & circuits: • The primary inputs are denoted as (i1,i2,i3,…,in) • The primary outputs are denoted as
(y1,y2,y3,...,ym) • The intermediate inputs are denoted as
(is1,is2,...,isa) • The intermediate outputs are denoted as
(os1,os2,..,osb) • Circuit under test (CUT) • EOR – Ex-or gate • PE – Priority encoder • MUX – Multiplexer
1. Algorithm single-stuck-at-fault 2. Inputs : (i1,i2,i3,……..……,in) and CUT 3. Outputs : (y1,y2,y3,…………,ym) 4. begin 5. for CUT do 6. create two more copies CUT and 2*m
7. connect the similar outputs of CUT and its copies to the inputs of EORs. 8. end for 9. for all EOR do 10. create m copies of PE 11. connect similar outputs of EORs to the inputs of PEs 12. end for 13. for all PE do 14. create m copies of MUX 15. connect the outputs of PEs to select lines of MUXs 16. connect the similar outputs of CUT and its copies to the input lines of MUXs 17. end for 18. for all inputs i1 to in do 19. check the outputs y1 to ym at MUXs 20. end for 21. end algorithm
This algorithm can be used to generate a fault tolerant design for any digital circuit. The algorithm is presented for a single fault simulation model, which consists of following steps:
1. Assign values to the primary inputs, i.e., (i1, i2, i3,…..,in), 2. Choose the interconnect or output of CUT randomly, 3. Perform logic simulation and record the monitored Values, 4. Force inverse value (inject fault) on chosen interconnect, 5. Again perform the logic simulation and record the monitored values, 6. Compare the monitored values in case (3) and
(5).
III. METHODOLOGY
This paper presents a novel technique to tolerate stuck at faults at the outputs of any digital circuit under test (CUT). If there is a fault at one of the outputs then circuit itself detects the fault and configures to provide the fault free output. In place of the voter circuit we have used a novel circuit, as shown in Fig.3, consisting of ex-or gates, priority encoder and multiplexer to produce fault free output at any moment of time. This approach allows achieving fault tolerance with respect to all possible faults. The idea is implemented for a full adder circuit as shown in Fig.4.
EOR
Fig 3. Novel fault tolerant circuit
V
i1
i2 pe
o
s1
s2
mux 2:1
A
B
C
4
Fig. 4. Circuit to tolerate interconnect stuck-at-faults The circuit under test is triplicated. The similar outputs of the CUT and its copies are fed to the ex-or gates. Again the similar outputs of the ex-or gates are fed to the priority encoders .The outputs of encoders are fed to two different multiplexers as select lines. The inputs to these multiplexers are the similar outputs of the two full adders, i.e., fa1 and fa3.We have used d flip flops (d-ffs) at the outputs of the CUTs so as to synchronize the outputs .It is assumed that only one fault occurs at a time. All possible faults are tested exhaustively by injecting the faults (0, 1) in all the nodes of the CUT using multiplexer as shown in Fig.5.
Fig 5. Logic node
The multiplexer can be inserted at the output net of the circuit to be tested. This technique can be generalized and implemented for tolerating faults in
any other circuit.
IV. TESTING STRATEGY
Three bits input is applied to the full adder and its copies. Outputs of full adders (fa1, fa2 and fa3) are s1, s2, s3 and c1, c2, c3, i.e., sum and carry respectively. This system is designed in such a way that it tests the outputs of full adders and if any of them is stuck at any fault level then the circuit selects the fault free output through priority encoder which is finally propagated to the output (sum_o / carry_o) through mux1 / mux2.The truth table of full adder for exhaustive testing is shown in Table 2.
If the input (abci) is “001” then outputs s1 , s2 and s3 are equal to ‘1’ , and since both the inputs to the ex-or gates are same , the outputs s4 and s5 of ex-or gates will be ‘0’ which are connected to the inputs of priority encoder .The priority encoder is designed in such a way that whenever i1/i3 is ‘0’ then the outputs of priority encoders will be assigned a value ‘0’ and whenever i2/i4 is ‘0’ then the outputs of priority encoders will be assigned a value ‘1’ , which are connected to the select lines(o1/o2) of mux1/mux2, hence s1(‘1’)/c1(‘0’) is propagated to the output sum_o /carry_o through mux1/mux2 , which is the correct (fault free) output. Here the priority is given
data
‘1’
‘0’
fault enable
node
s1 q1
reset
fa1
a1
b1
fa2
fa3
ci1 d q
d q
d q
d q
d q
d q
i1
i2
pe1
i3
i4
pe2
o2
o1
clk
sum_o
carry_o
S2
s3
c1
c2
c3
m1
m2
s4
s5
c4
c5
q2
q3
q4
q5
q6
a2
ci2
b2
ci3
b3
a3
5
to i1/i3 inputs. Now assume that s1 is s-a-0, then the output s4 of ex-or gate becomes ‘1’ (as s1=q1=‘0’ and s2=s3=‘1’), i2 becomes ‘0’, therefore o1 will be ‘1’ which is the select signal for mux1 and hence s3 (‘1’) is propagated to the output sum_o through mux1 making it the fault free/correct output.
Now let us consider the input as “011”.In this case outputs s1 , s2 and s3 will be ‘0’, outputs s4 and s5 of ex-or gates will also be ‘0’ and as i1 is ‘0’ ,the select line of mux1 will be ‘0’ making the final sum output ( sum_o )as ‘0’ . Now assume that s1 is s-a-1, then the output s4 of ex-or gate becomes ‘1’ (as s1=q1=‘1’ and s2=s3=‘0’),i2 becomes ‘0’ , therefore o1 will be ‘1’ which is the select signal for mux1 and hence s3(‘0’) is propagated to the output sum_o through mux1 making it the fault free/correct output.
Similar testing and detecting technique is applied using the remaining test vectors, i.e., “000” and “111” and by inserting the faults at either c1 or c2 or c3 for checking the carry output carry_o at the output of mux2. TABLE 2 TRUTH-TABLE (EXHAUSTIVE TESTING)
This circuit can also be tested for any stuck at fault using pseudo- random testing as shown in Table 3.
TABLE 3
TRUTH-TABLE (PSEUDO-RANDOM TESTING)
TABLE 4
FAULT COVERAGE USING NOVEL FAULT TOLERANT TECHNIQUE (NFT)
TABLE 5 COMPARISON OF THREE TECHNIQUES IN TERMS OF AREA AND PERFORMANCE FOR FULL ADDER CIRCUIT
USING ISE 8.2i
Input Output a b ci s co 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1
Sr. No. Circuit
No. of injected faults
No. of tolerated
faults
Fault tolerance
(%)
1. Full Adder 46 46 100
2. 8 to 3 Encoder 96 96 100
3. 3-Bit shift register 20 20 100
4. 3-Bit ripple counter 42 42 100
5. 8-Bit microprocessor 168 168 100
Sr. No.
Fault tolerance technique
Delay (ns)
No. of i/o
pads
No. of nets
No. of instances
No. of LUTs (F.G.)
No. of CLBs
No. of i/ps
No. of
o/ps
Total equiv. gate
counts
1. None 8.369 05 10 09 02 01 03 02 12
2. T.M.R. 13.364 11 29 22 09 05 09 02 54
3. N.F.T. 9.977 11 28 21 08 05 09 02 48
Input Output
a b ci s co
0 0 0 0 0
0 1 1 0 1
0 0 1 1 0
1 1 1 1 1
6
We have compared the three implementations of full adder circuit to check area, performance and power dissipation in Xilinx’s XCV50-6PQ240 FPGA: normal CUT, TMR and our technique (NFT) for permanent as well as transient faults. Power dissipation in each circuit was evaluated using Xilinx’s XPower tool. Table 4 shows the results in terms of fault coverage for the five circuits mentioned earlier. Table 5 shows the results in terms of area and performance for three implementations of full adder circuit. Using our technique we have reduced the area to some extent. In terms of performance, the normal full adder circuit without fault tolerance had a maximum delay of 8.369ns, the TMR version had a delay of 13.364 ns and our NFT had a delay of 9.977ns, representing an improvement of 25.344% in performance using our technique. The improvement in performance and power dissipation was more when used over larger/bigger circuits.
Table 6 shows the comparison of three implementations in terms of power dissipation. Total power dissipation using NFT was also less as compared to TMR method. This design is also capable of producing the fault free outputs even in case of occurrence of any stuck at faults at the inputs of priority encoders / multiplexers and at the output of priority encoders.
TABLE 6
COMPARISON IN TERMS OF POWER DISSIPATION
V. SIMULATION AND SYNTHESIS The fault-tolerance ability of the design was
tested and verified by means of VHDL simulation programs. We have written VHDL codes for top entity and various components. We have used Modelsim SE 6.2e for verifying the design. The circuits were synthesized and tested for Xilinx’s XCV50-6PQ240 FPGA. Synthesis was done using the Xilinx’s synthesizer tool (XST) of ISE Foundation series 8.2i.
VI. CONCLUSION
Many techniques have been suggested in
past to detect and tolerate interconnect stuck at faults. The technique presented in this paper discusses the stuck at faults with a novel algorithm. The methodology could be generalized for other such circuits also in the similar way. We have successfully implemented and tested this design onto Xilinx’s XCV50-6 PQ240 FPGA.
REFERENCES
[1] M.K. Stojcev, G.Lj. Djordjevic, T.R.Stankovic,
“Implementation of self- checking two-level combinational logic on FPGA and CPLD circuits”,journal of Microelectronics Reliability, issue 44, 2004, pp. 173-178.
[2] Lala PK., “Self-checking and fault-tolerant digital system design”, San Francisco: Morgan Kuffman Publishers, 2001.
[3] Castilo X et al, “Boundary-scan test: A practical approach”, Dordecht Kluwer Academic Publishers, 1993.
[4] Monica Alderighi’, Sergio D’Angelol, Cecilia Metra’, and Giacomo R.Sechi “Novel Fault-Tolerant Adder Design for FPGA - Based Systems” , IEEE Proceedings on On–Line Testing Workshop, 2001, Volume 7, 2001, pp. 54 – 58.
[5] J.Lach and W. H. Mangione-Smith and M. Potkonjak, “Low Overhead Fault-Tolerant FPGA Systems”, IEEE Trans. On VLSI Systems, 1998, 6(2), pp. 212 - 221, June.
[6] Sandi Habinc, “Functional Triple Modular Redundancy (FTMR)”, Design and Assessment Report, Gaisler Research, FPGA- 003-01, ver.0.2, 2002, pp. 1-55, December.
[7] F.Hanchek and S. Dutt, “Methodologies for Tolerating Cell and Interconnect Faults in FPGAs”, IEEE Transactions on Computers, 1998, Vol. 47, pp. 15 - 33, January.
Sr. No.
Fault tolerance technique
Power dissipation(mW)
1. None 127.77
2. T.M.R. 135.73
3. N.F.T. 129.54