Error-Detecting Code-Based Minimum Logic of Guaranteed Timing/Soft Error Resilience

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 1, JANUARY 2013 1

Error-Detecting Code-Based Minimum Logic of Guaranteed Timing/Soft Error Resilience

Bao Liu, Lu Wang and Fiona Teshome

Abstract—Nanoscale VLSI systems are subject to increasingly prevalent catastrophic defects, soft errors, and signifi-cant parametric variations, which cannot be reduced below certain levels according to quantum physics, and must be handled by new design methods. In this paper, we leverage the existing fault-secure logic design techniques, and pro-pose design methodologies for (1) group-sliced logic (GSL) networks with outputs in group distance-two code for guar-anteed single soft error resilience, and (2) inversion-free logic (IFL) networks with outputs in delay-insensitive (DI) code, which clears all timing errors and achieves adaptive maximum performance in the absence of external soft errors at a higher area/power cost compared with the existing logic paradigms. Our experimental results demonstrate that (1) GSL achieves guaranteed single soft error resilience at the cost of an average of 1.63× area, 1.63× critical path delay, and 2.17× power consumption, while DMR achieves an average of 2.12× area, 1.26× critical path delay, and 2.79× power consumption compared with the minimum area design; (2) dual-rail static (Domino) IFL outperforms alternative delay-insensitive (DI) code-based static (Domino) IFL with less area, higher performance and lower power consumption in all test cases, and achieves an average of 2.29(2.41)× performance boost, 2.12(1.91)× layout area and 2.38(2.34)× power consumption compared with the traditional minimum area static logic based on the Nangate 45nm open cell li-brary.

Index Terms—VLSI; Reliability; Performance; Fault Tolerance.

—————————— u ——————————

1 INTRODUCTION

s VLSI technology scales into the nanometer do-‐‑main, VLSI systems are subject to increasingly prev-‐‑

alent catastrophic defects and soft errors. They are further subject to significant parametric variations that cannot be reduced below certain levels at nanometer scale accord-‐‑ing to quantum physics [3]. Such parametric variations lead to signal propagation delay variations at component level, which may accumulate and lead to timing errors. As a result, nanoscale VLSI design faces an unprecedent-‐‑ed challenge that is to achieve a reliable system based on unreliable components [5].

The existing fault-‐‑tolerant or soft/timing error-‐‑resilient VLSI design techniques bear limitations. Asyn-‐‑chronous design is inherently insensitive to performance variation. However, practical asynchronous design is lim-‐‑ited due to the lack of an efficient design methodology [18]. BISER design [22,35], Razor logic [14] and other con-‐‑current-‐‑checking design techniques [16] provide limited timing/soft error-‐‑detecting/correcting capabilities based on dual-‐‑module redundancy (DMR) or partial logic du-‐‑plication. Triple-‐‑module redundancy (TMR) [2] and n-‐‑module redundancy (NMR) [12] provide the basis for traditional error-‐‑correcting fault-‐‑tolerant systems. Mod-‐‑

ern wireless/on-‐‑chip communication and nanoscale memory systems [13,25,31] rely on more complex and efficient error detecting/correcting codes (EDCC) [23] for reliability enhancement. EDCC have also been applied in VLSI systems [1,8,15,36]. In this paper, we leverage the existing totally self-‐‑checking (TSC) / strongly faultsecure (SFS) logic design techniques [28], and propose design methodologies for (1) group-‐‑sliced logic (GSL) networks with outputs in a group distance-‐‑two code, and (2) inversion-‐‑free logic (IFL) networks with outputs in a delay-‐‑insensitive (DI) code for soft/timing error resilience and maximum adap-‐‑tive performance in nanoscale computing. Our experi-‐‑mental results show that (1) GSL achieves guaranteed single soft error resilience at the cost of an average of 1.63× area, 1.63× critical path delay, and 2.17× power con-‐‑sumption, while DMR achieves an average of 2.12× area, 1.26× critical path delay, and 2.79× power consumption compared with the minimum area design; (2) dual-‐‑rail static (Domino) IFL outperforms alternative DI codebased static (Domino) IFL with less area, higher performance and lower power consumption for the large test cases, and achieves an average of 2.29(2.41)× performance boost, 2.12(1.91)× layout area and 2.38(2.34)× power consump-‐‑tion compared with the traditional minimum area static logic based on the Nangate 45nm open cell library. The rest of the paper is organized as follows. We re-‐‑view the background of the problem and briefly go through the existing techniques in Section 2.We present

———————————————— • B. Liu and L. Wang are with the Electrical and Computer Engineering

Department, the University of Texas at San Antonio, San Antonio, TX 78249.

• F. Teshome is with IBM, Tucson, AZ.

A

2

our preliminary works and the theoretical foundation of the proposed logic styles, namely the Strongly-‐‑Fault-‐‑Secure logic design techniques in Section 3, before pre-‐‑senting our proposed GSL and IFL design methodologies, their analysis and synthesis flows in Section 4. We evalu-‐‑ate GSL and IFL by experiments in Section 5, and con-‐‑clude in Section 6.

2 BACKGROUND 2.1 Nanoscale VLSI Reliability Challenge Nanoscale VLSI reliability is challenged by a number of phenomena. In the manufacturing process, parametric variations include geometric variations in the lateral (transistor channel length) and vertical (gate oxide thick-‐‑ness) dimensions, dopant fluctuation, mechanical stress, etc. Severe process variations lead to catastrophic defects, e.g., open/short interconnects, transistor oxide break-‐‑down, transistor channel punchthrough, etc., which result in yield loss. Less severe process variations may accumu-‐‑late over time (through aging) or space (across a chip, i.e., volume defects), and result in system failure. In the system runtime, VLSI parametric variations include on-‐‑chip temperature variation, supply voltage degradation, stress and radiation or cosmos ray strikeinduced single-‐‑event transients (SETs). Severe runtime parametric varia-‐‑tions lead to soft errors, which are transient logic errors during system runtime, including: race/hazard-‐‑induced circuit intrinsic glitches, capacitive or inductive intercon-‐‑nect crosstalk noises, radiation or cosmos ray strike-‐‑induced single-‐‑event upsets (SEUs), and setup/hold time constraint violation-‐‑induced timing errors. Less severe system runtime parametric variations may accumulate (e.g., along a signal propagation path), and lead to a (e.g., timing) error. Such significant parametric variations have led to increasingly significant performance variability at recent technology nodes, which has severely hindered performance scaling. Further, VLSI systems will be sub-‐‑ject to increasingly prevalent defects and soft errors in the long run. Such prevalent defects, soft errors, and signifi-‐‑cant parametric variations cannot be reduced below cer-‐‑tain levels at nanometer scale (e.g., by improving the manufacturing process) due to the uncertainty principle of quantum physics. Consequently, design techniques are much needed for defect, soft error, and performance vari-‐‑ation-‐‑tolerant nanoscale VLSI circuits.

2.2 Existing Timing/Soft Error-Resilient and Other Fault-Tolerant Techniques Asynchronous circuits are insensitive to unlimited per-‐‑formance variations, making them an ideal delay insensi-‐‑tive circuit paradigm in nanometer technologies [24, 28,32]. Practical application of asynchronous circuit de-‐‑sign has been limited due to the lack of an efficient asyn-‐‑chronous design methodology (e.g., for finite state ma-‐‑

chines) [18]. Complex asynchronous design and verifica-‐‑tion may not guarantee function correctness for all possi-‐‑ble delay variations and resultant glitches. As a result, globally-‐‑asynchronous-‐‑locally-‐‑synchronous (GALS) de-‐‑sign finds increasing applications in practice today.

The first BISER (build-‐‑in soft error-‐‑resilient) de-‐‑sign technique was proposed to enhance soft error im-‐‑munity for sequential elements [35], wherein a Muller C-‐‑element combines the outputs of two duplicate flip-‐‑flops. A Muller C-‐‑element is a basic building block in asynchro-‐‑nous circuit design, which holds its output unchanged if its inputs differ. At the occurrence of a soft error which corrupts one of the two flip-‐‑flops, the Muller C-‐‑element holds the output unchanged, achieving single soft error resilience.

To correct soft errors in combinational logic net-‐‑works, an extension of the technique deploys a Muller C-‐‑element that combines the outputs of two latches that sample the combinational logic outputs nonsimultaneous-‐‑ly, or, simultaneously if the combinational logic network is duplicated along with the latch [22].

A similar technique is Razor logic [14] or Intel er-‐‑ror detection sequential (EDS) design [7], wherein each flip-‐‑flop is accompanied by a shadow latch. The flip-‐‑flop and the shadow latch sample the same combinational logic network output. Their contents are compared. Any mismatch indicates either a timing error or a soft error. Upon error detection, a higher level response mechanism triggers recomputation, e.g., by re-‐‑fetching instructions at the architecture level.

BISER design, Razor logic, and other concurrent-‐‑checking schemes [16] achieve limited error-‐‑detecting/correcting capability based on dual-‐‑module redundancy (DMR) or partial duplication. Triple-‐‑module redundancy (TMR) [2] and n-‐‑module redundancy (NMR) [12] provide the basis for traditional error-‐‑correcting fault-‐‑tolerant systems. Modern fault-‐‑tolerant systems, e.g., modern wireless/on-‐‑chip communication [26] and nanoscale memory systems [13,25,31] rely on more effi-‐‑cient error-‐‑detecting/correcting codes (EDCC) [23] for reliability enhancement.

EDCC have also been applied in VLSI design. A number of error-‐‑detecting codes, including parity code, group parity code [11,30], Hamming code [1,15,36], one-‐‑hot code [36], m-‐‑out-‐‑of-‐‑n code and Berger code [4,19] have been applied in self-‐‑checking control logic design. AN codes and residue codes have been applied in self-‐‑checking arithmetic circuits [8]. In these circuits, error detection triggers correction mechanism at a higher level, e.g., instruction re-‐‑fetching at the architecture level. ECC form the basis of fault-‐‑tolerant error-‐‑correcting logic. Armstrong proposed forward error-‐‑correcting logic in 1961 [1], wherein an error-‐‑correcting module intercepts the sequential element outputs, and sends the corrected

3

signals to the next stage. Some of the state-‐‑of-‐‑the-‐‑art im-‐‑plementations can be found in recent publications [15,36].

Alternatively, in backward error-‐‑correcting logic, cor-‐‑rected signals are sent back to the sequential elements in the current stage, without necessarily lengthening the timing critical path [20]. However, such ECC-‐‑based error-‐‑correcting logic is inefficient in correcting soft errors in a combinational logic network, because a single bit soft error at a combinational logic network may propagate in multiple paths and corrupts a number of combinational logic outputs, which may exceed an ECC’s error-‐‑correcting capacity. For example, a Hamming code cor-‐‑rects no more error bits than its Hamming distance.

3 PRELIMINARIES 3.1 Error Detection-Based Timing/Soft Error-

Resilient Logic A recent publication has proposed error detection rather than error correction for timing/soft error resilience [20]. The motivating observation is that a timing/soft error lasts only for a few nanoseconds. Consequently, an on-‐‑the-‐‑fly error-‐‑detecting scheme which halts the circuit from operation at the occurrence of a timing/soft error suffices in achieving timing/soft error resilience. The cir-‐‑cuit resumes operation once the timing/soft error is not present. Relying on error detection rather than error cor-‐‑rection leads to much reduced hardware overhead.

For example, error detection can be combined with clock gating for timing/soft error resilience (Fig. 1). In this logic paradigm, the combinational logic outputs are encoded in an error-‐‑detecting code (EDC). An error-‐‑detecting logic module checks the combinational logic outputs and gates the clock. The flip-‐‑flops only latch codewords at the logic outputs. If a timing/soft error leads to a non-‐‑codeword at the combinational logic out-‐‑puts, the error-‐‑detecting logic blocks the clock, and pre-‐‑vents the flip-‐‑flops from latching the incorrect signals. Once the timing/soft error is not present, and a codeword appears at the combinational logic outputs, the circuit resumes operation at the earliest moment, achieving min-‐‑imum performance degradation.

Fig. 1. Error detection and clock gating-based soft error-resilient logic.

A critical problem for such an error detection-‐‑based timing/soft error-‐‑resilient logic paradigm is as fol-‐‑lows. Because the error-‐‑detecting logic only detects non-‐‑codewords at the combinational logic outputs, we must

prevent the combinational logic outputs from undergoing an undetected transition to an incorrect codeword at the occurrence of a timing/soft error. For example, if a single bit soft error at an internal node propagates through the combinational logic network and corrupts two output bits, it would not be detected by a parity checker. If we encode the combinational logic outputs in a Hamming code, we must guarantee that any soft error in the combi-‐‑national logic network does not corrupt more output bits than the Hamming distance. Similarly, for an error detec-‐‑tion-‐‑based self-‐‑timing logic, we must guarantee that any incomplete combinational logic computation does not lead to an incorrect codeword at the combinational logic outputs.

To solve this problem, we revisit the TSC/SFS logic design techniques in literature [29].

3.2. Totally Self-Checking (TSC) / Strongly Fault-Secure (SFS) Logic Networks Definition 1. A logic network G of input code space A and output code space B is fault secure with respect to a fault set F if for all faults f ∈ F and all code inputs a ∈ A, the output is either correct G(a, f) = G(a, ∅), or is a non-‐‑codeword, i.e., be-‐‑yond the output code space G(a, f) ∉B. Definition 2. A logic network G of input code space A and output code space B is self-‐‑testing with respect to a fault set F if for each fault f ∈ F, there is at least one code input that produc-‐‑es a non-‐‑codeword output ∃ a ∈ A | G(a, f) ∉ B. Definition 3. A logic network G is totally self-‐‑checking (TSC) with respect to a fault set F if it is fault secure and self-‐‑testing with respect to F. Definition 4. A logic network G of input code space A and output code space B is strongly fault-‐‑secure (SFS) with respect to a fault set F if for all fault sequences < f1, f2, . . . fn >, fi ∈ F and all code inputs a ∈ A, the output is either correct G(a,< f1, f2, . . . fn >) = G(a, ∅) or is a non-‐‑codeword G(a,< f1, f2, . . . fn >) ∉ B. The set of strongly fault-‐‑secure (SFS) logic networks in-‐‑cludes all totally self-‐‑checking (TSC) logic networks [29]. Literature [29] further gives two specific types of TSC/SFS logic networks as follows. 3.2.1. b-Byte Distance-Two Code-Based b-Byte Sliced Networks Definition 5. A b-‐‑byte distance-‐‑two code is a code in which the codeword bits are in groups of size b, and any two codewords differ in at least two groups. Definition 6. A logic network is b-‐‑byte sliced if the outputs are in groups of size b, and each group is realized by an inde-‐‑pendent logic network having only the primary inputs in com-‐‑mon. Theorem 1. A b-‐‑byte sliced logic network with outputs in a b-‐‑byte distance-‐‑two code is TSC/SFS with respect to single faults. For example, the most simple b-‐‑byte distance-‐‑two code is achieved by duplicating the codeword bits (or including

4

the complements, which forms a dual-‐‑rail code), while the most simple b-‐‑byte sliced logic network is achieved by duplicating the combinational logic network (or including the complementary logic network) (Fig. 2). Such a logic network is TSC/SFS to any single bit soft error, because any single bit soft error can only corrupt output bits in one of the two independent logic networks. The resultant non-‐‑codeword will be detected by a codeword checker, which is a comparator in this case. 3.2.2. DI Code-Based Inversion-Free Networks Definition 7. A binary vector x covers binary vector y (x ≥ y) if x has a 1 in every position that y has a 1, or y[i] = 1 ⇒ x[i] = 1 ∀i. A code C is unordered, all unidirectional error-‐‑detecting (AUED) [Bose and Rao 1982], or delay-‐‑insensitive (DI) if no codeword covers any other codeword [Verhoeff 1988].

Examples of unordered/AUED/DI codes include: m-‐‑out-‐‑of-‐‑n or m-‐‑hot codes, Berger codes, and dual-‐‑rail codes [33]. A m-‐‑out-‐‑of-‐‑n or m-‐‑hot codeword has a fixed number of logic one bits. A Berger codeword includes two parts: the information bits and the check bits, wherein the check bits are the binary representation of the number of logic zero’s in the information bits. An n-‐‑bit dual-‐‑rail codeword includes n/2 information bits, and n/2 complementary bits. Or, it contains n/2 bits of logic one, and n/2 bits of logic zero. In summary, an unordered/AUED/DI code-‐‑word includes a fixed number of logic one bits (in a m-‐‑hot or dual-‐‑rail code), or a pre-‐‑defined number of logic one bits (in a Berger code).

Fig. 2. DMR error detection and clock gating-based soft error-resilient logic. Definition 8. Unidirectional faults are either all stuck-‐‑at-‐‑0 or all stuck-‐‑at-‐‑1 faults. Theorem 2. Any inversion-‐‑free logic network with outputs in a DI code is TSC/SFS with respect to unidirectional faults [29]. This is because occurrence of unidirectional faults chang-‐‑es the number of logic one bits at the logic outputs, lead-‐‑ing to a non-‐‑DI-‐‑codeword to be detected. Domino logic is a specific inversion-‐‑free logic.

4 GROUP-SLICED LOGIC (GSL) In this work, we study the following problem:

Problem 1 (Minimum Logic of Single Error Detection): Given a combinational logic network, construct the minimum cost logic network of guaranteed single error detection. A reduced problem is as follows based on Theorem 1. Problem 2 (Minimum Cost Group-‐‑Sliced Logic Net-‐‑work): Given a combinational logic network, construct an equivalent minimum cost group-‐‑sliced logic network with out-‐‑puts in a group distance-‐‑two code. A slightly different problem is as follows. Problem 3 (Minimum Cost Group Distance-‐‑Two Code): What is the minimum length of a group distance-‐‑two code con-‐‑sisting at least 2n codewords? A few well known linear code bounds include the Ham-‐‑ming bound, which gives the lower bound for this prob-‐‑lem, and the Gilbert-‐‑Varshamov bound, which gives the upper bound (i.e., existence results) for this problem. We study a more flexible problem wherein bit groups are allowed to have different sizes. We propose a minimum group distance-‐‑two code construction method (Algorithm 1). Algorithm 1. Construct a Group Distance-‐‑2 Code Input: n. Output: Group distance-‐‑2 code consisting at least 2n codewords. 1. Construct a binary code of length n 2. Partition the binary codeword bits into k groups gi,

where 0 < i < k, k ≥ 2 3. Compute the k+1-‐‑th bit group gk = ⊕0<i<k gi, which has

|gk| bits, where |gk| = Max0<i<k|gi| 4. Concatenate the k + 1 bit groups Theorem 2. Algorithm 1 gives a group distance-‐‑2 code. Proof. If two codewords differ by only one group in the first k groups, they must also differ in the k + 1 group. As a consequence, they differ in two groups and have a group distance of two. Otherwise, the two codewords differ in at least two groups in the first k groups, and they have a group distance of at least two. In summary, any two codewords constructed by Algorithm 1 have a group distance of at least two. o

E.g., Table I gives a group distance-‐‑two code, wherein codeword bits are in three groups, each group includes two bits. The first two groups give a binary code, while the third group is the XOR result of the first two groups. An answer to Problem 3 is as follows. Corollary 1. The minimum length of a group distance-‐‑two code of k + 1 bit groups consisting at least 2n codewords is no more than n + ⌈n / k⌉.

For all possible group numbers k, the minimum length group distance-‐‑two code consisting of at least 2n codewords is the parity code of length n + 1, which in-‐‑cludes only one bit in each group. However, Problem 2 is

5

more difficult to solve than Problem 3. Algorithm 1 fur-‐‑ther helps here. Corollary 2. The legality of a group distance-‐‑two code con-‐‑structed by Algorithm 1 can be verified by taking exclusive OR.

Fig. 3. A group-sliced logic network with XOR gates and an OR gate checking the group distance-two code at the logic output and gating the clock.

Corollary 2 provides an efficient codeword legal-‐‑ity check method for a group distance-‐‑2 code constructed by Algorithm 1. Subsequently, we propose Algorithm 2 which constructs a group-‐‑sliced network with the outputs in a group distance-‐‑2 code (Fig. 3).

Algorithm 2. Construct a GSL Network with Outputs in a Group Distance-‐‑2 Code Input: Logic network N. Output: GSL network N’ with outputs in a group dis-‐‑tance-‐‑two code. 1. Construct a group distance-‐‑2 code for the outputs,

e.g., by Algorithm 1. 2. Synthesize an independent logic network for each of

the k +1 output bit groups. 3. Include XOR gates which check the output codeword

legality and gate the clock. Theorem 3. A logic stage constructed by Algorithm 2 com-‐‑bined with an ECC or BISER scheme achieves guaranteed logic correctness in the presence of any single soft error. Proof. (1) Any single soft error at the combinational logic network is detected (Theorem 1). (2) A false positive (reporting an error while there is no error) at the errorde-‐‑tecting and clock-‐‑gating logic only delay the function of the circuit. A false negative (reporting no error while there is an error) requires two simultaneous soft errors, which is beyond the scope of our study. (3) Any ECC or BISER scheme achieves guaranteed correction for any single soft error at the sequential elements. o

A further knob of optimization is in grouping the bi-‐‑nary codeword bits. We try a few groupings for the min-‐‑imum cost logic network in our experiment.

5 INVERSION-FREE LOGIC (IFL) We propose to construct Inversion-‐‑Free Logic (IFL) net-‐‑works with outputs in a DI code for timing error resili-‐‑ence and maximum adaptive performance in nanoscale computing. An IFL stage includes:

1) an inversion-‐‑free functional combinational logic network with outputs in a DI code,

2) a DI codeword legality checker which checks the DI codewords at the combinational logic outputs, and generates a clock signal for the receiving flip-‐‑flops/latches, and

3) the receiving flip-‐‑flops/latches. IFL runs in a sequence of two phases: (1) pre-‐‑charge and (2) evaluate in a Domino logic implementation, or (1) re-‐‑set and (2) compute in a static logic implementation.

Fig. 4. A standard Domino IFL stage, including a standard Domino logic block, which outputs a DI code, and a DI codeword checker, which generates the clock and pre-charge signals. 4.1. Domino Logic In a Domino logic stage, the DI codeword checker gener-‐‑ates a clock signal for the flip-‐‑flops/latches, which also provides the pre-‐‑charge signal for the dynamic gates (Fig. 4). Domino RAP logic runs as follows.

1) A complete Domino logic computation yields a DI codeword at the logic outputs.

2) The DI codeword legality checker outputs a ris-‐‑ing clock edge.

3) The flip-‐‑flops/latches sample the inputs. 4) The Domino logic pre-‐‑charges. 5) The Domino logic outputs return to zero. 6) The DI codeword legality checker outputs a fall-‐‑

ing clock edge. 7) The Domino logic evaluates.

6

Fig. 5. A standard static IFL logic stage, including (1) in-verting static logic gates (in place of dynamic logic gates in Domino logic), and (2) static NOR gates with a side input of reset (in place of inverters in Domino logic). 4.2. Static IFL Domino logic suffers in reliability. At the occurrence of a soft error, Domino logic needs to re-‐‑compute, while static logic needs only to wait for the soft error to subside, and a valid codeword appears at the outputs. Analogous to Domino logic, we propose static IFL. In standard static IFL (analogous to standard Domino logic), each stage includes an inverting logic gate (analogous to a dynamic logic gate in standard Domino logic), and a NOR gate with a side input of reset (analo-‐‑gous to an inverter in standard Domino logic) (Fig. 5). In complex static IFL (analogous to complex Domino logic), each stage includes inverting logic gates (analogous to dynamic logic gates in complex Domino logic), and an additional inverting gate with a side input of reset giving the stage output (analogous to a static logic gate in com-‐‑plex Domino logic). At the stage level, a static IFL net-‐‑work is inversion-‐‑free. Analogous to Domino logic which operates in a two-‐‑phase sequence of pre-‐‑charge and evaluate, static IFL runs in a two-‐‑phase sequence of reset and compute. Dur-‐‑ing compute, all signals are either stable or rising at the stage level. I.e., static IFL satisfies the monotonicity re-‐‑quirement. Static IFL runs as follows.

1) A complete logic computation yields a DI code-‐‑word at the logic outputs.

2) The DI codeword legality checker gives a rising output.

3) The flip-‐‑flops/latches sample the inputs. 4) The combinational logic network resets. All the

stage outputs return to zero. 5) The combinational logic outputs return to zero. 6) The DI codeword legality checker gives a falling

output. The clock returns to zero. 7) The combinational logic network computes.

4.3. Analysis Theorem 4. In the absence of external soft errors (e.g., noises and SEUs), IFL is self-‐‑timing (i.e., it detects logic computation completion). Proof.

1) Logic computation in an inversion-‐‑free logic network (e.g., at the stage level in RAP logic) af-‐‑ter reset is monotonic, i.e., all signals are either stable or rising (hence glitch-‐‑free), in the absence of external soft errors.

2) Timing errors are unidirectional (stuck-‐‑at-‐‑0) faults in monotonic (rising) logic computation.

3) An inversion-‐‑free logic network with outputs in DI code detects all unidirectional faults (Theorem 3.10), including all timing errors in logic compu-‐‑

tation after reset in the absence of external soft errors. o

In the presence of external soft errors (e.g., noises and SEUs), we have further observations as follows.

1) In the presence of stuck-‐‑at-‐‑0 external soft errors, IFL is self-‐‑timing and errorresilient.

2) In the presence of bidirectional errors, e.g., a stuck-‐‑at-‐‑1 external soft error and a timing error, the error rate of IFL is given by the probability that an incorrect codeword appears at the logic output.

3) To remove stuck-‐‑at-‐‑1 external soft errors (e.g., noises and SEUs), we need other techniques, e.g., shielding to remove coupling noises, and enhanc-‐‑ing of the NMOSFETs in dynamic gates and the PMOSFETs in static gates to resist all stuck-‐‑at-‐‑1 SEUs.

4.4. IFL Synthesis Algorithm 3 gives our IFL synthesis algorithm. To pre-‐‑vent logic synthesis from removing the functionally-‐‑redundant check logic, we synthesize the DI codeword check logic separately from the function logic. Algorithm 4 gives an inversion two-‐‑colorable logic net-‐‑work for complex static or Domino logic. Applying bub-‐‑ble pushing based on De Morgan’s theorem subsequently gives an inversion-‐‑free logic network for standard static or Domino logic. If complex logic gates are available, merging two layers of inverting gates to form a non-‐‑inverting stage also gives a stage-‐‑level inversion-‐‑free logic network.

In an inversion two-‐‑colorable logic network, each node can be colored in one of two colors, such that nodes across an inverting gate are in different colors, while nodes across a non-‐‑inverting gate are in the same color. If a logic network is not inversion two-‐‑colorable, it must contain a loop of an odd number of inverting gates, for example, a reconvergent fanout where a signal propaga-‐‑tion path of an even number of inverting gates and an-‐‑other signal propagation path of an odd number of in-‐‑verting gates diverge before reconverge. We need to break such a loop by including duplicate logic, which brings hardware overhead. Note that each gate needs at most one duplicate, which is in a different color than the original gate. To resolve subsequent color conflicts in-‐‑volving any of the two gates, we only needs to wire to the gate of the needed color.

Any Boolean logic can be realized in an inver-‐‑sion-‐‑free or inversion two-‐‑colorable logic network by Al-‐‑gorithm 2. This is also proven in literature. Theorem 5. Any logic can be realized by inversion-‐‑free logic if the inputs and the outputs are in DI codes [21]. For example, any logic can be realized in an inversion-‐‑free network if all inputs and outputs are given with their complements, i.e., in dual-‐‑rail code.

7

Algorithm 3. IFL Logic Synthesis Input: Design specifications. Output: IFL logic stage. 1. Synthesize functional logic with inputs and outputs

in DI codes 2. Synthesize and include codeword check logic, which

generates the clock signal 3. Transform to an inversion-‐‑free or inversion two-‐‑

colorable logic network by Algorithm 2 Algorithm 4. Generate an Inversion Two-‐‑Colorable Logic Network Input: Logic network N. Output: Inversion two-‐‑colorable network N′ of equal log-‐‑ic. 1. Traverse network N from the outputs; 2. Color each node according to logic inversion; for each node i of color conflict do 3. Find or create a duplicate node i′ and its fanin cone; 4. Re-‐‑wire to the duplicate node i′; 5. Color the original and duplicate nodes i and i′ differ-‐‑

ently; end

To synthesize dual-‐‑rail RAP logic based on a

minimum area logic network, we applyAlgorithm 4 with all nodes marked of color conflict to duplicate the entire minimum area logic network, before including the dual-‐‑rail codeword check logic. Signal transitions in the dual-‐‑rail code check logic are monotonically rising and glitch-‐‑free during evaluation/computation.

6 EXPERIMENTS In this section, we evaluate GSL and IFL and compare them with some of the existing logic styles such as mini-‐‑mum area CMOS logic, DMR, and Domino logic in terms of area, timing performance and power consumption. We perform logic synthesis based on Synopsys Design Vision and the 45nm Nangate Open Cell Library [27]. Our test cases include a 10-‐‑state FSM (fsm cc8 2.v in [10]), a 16-‐‑state FSM (prep4.v in [17]), a 30-‐‑state FSM (modified from Control Unit.v in [9]), and a 30/70-‐‑state asynchronous up/down counter.

We first evaluate GSL. For each FSM, we perform logic synthesis for a number of finite state encoding schemes as follows.

1) Binary finite state encoding, achieving minimum area with no soft error correction capability.

2) DMR, including a duplicate of the next state log-‐‑ic, a group of XOR gates and an OR gate for iden-‐‑tity check at the next state logic outputs, and a

clock gate. We include a single copy of the out-‐‑put logic.

3) Group distance-‐‑2 finite state encoding, with the next state logic sliced in k groups, a group of XOR gates and an OR gate for codeword legality check and a clock gate.

Table I gives the layout areas, the critical path delays, and the power consumptions of these FSM implementations (in absolute numbers and in ratios to the minimum area implementations), respectively. Our observations are as follows.

1) The minimum logic network area (besides the minimum codeword length) is achieved with the group size |gk| = 1 among all the group-‐‑sliced logic networks with outputs in a group distance-‐‑two code for these 5 FSM test circuits.

2) GSL achieves guaranteed single soft error resili-‐‑ent logic networks of an average of 1.63× area, 1.63× critical path delay, and 2.17× power con-‐‑sumption, while DMR achieves an average of 2.12× area, 1.26× critical path delay, and 2.79× power consumption compared with the mini-‐‑mum area design for these 5 FSM test circuits.

TABLE 1. AREA (µm2), CRITICAL PATH DELAY (ns), AND POWER CONSUMPTION (µW) (IN ABSOLUTE NUMBERS AND RATIOS TO THE MINIMUM AREA IM-PLEMENTATION) OF FSM IMPLEMENTATIONS WITH BINARY, DMR, AND GROUP DISTANCE-TWO FINITE STATES ENCODINGS.

10-state FSM (fsm cc8 2.v [10]) Area (µm2)

Delay (ns)

Power (µW)

Binary 81.56 2.60 2.26 DMR 166.98 3.26 4.51 [1,1,1,1,1] dist-2 117.24 3.30 3.17 [2,2,2] dist-2 131.30 3.76 3.31 16-state FSM (prep4.v [17]) Binary 183.21 2.78 2.12 DMR 354.42 3.16 8.17 [1,1,1,1,1] dist-2 329.90 5.02 6.99 [2,2,2] dist-2 356.60 4.86 7.44 30-state counter Binary 124.87 2.46 2.29 DMR 257.89 3.22 5.48 [1,1,1,1,1,1] dist-2 184.41 3.71 4.02 [2,2,2,2] dist-2 209.70 4.46 4.00 [2,3,3] dist-2 208.72 4.38 4.40 30-state FSM (Control Unit.v [9]) Binary 178.52 2.59 2.40 DMR 316.33 3.37 6.75 [1,1,1,1,1,1] dist-2 310.24 4.93 6.11 [2,2,2,2] dist-2 344.51 5.24 7.10 [2,3,3] dist-2 324.61 4.60 6.47 70-state counter Binary 194.65 3.13 3.12 DMR 544.44 4.14 9.08 [1,1,1,1,1,1,1,1] dist-2 334.41 5.16 5.72 [2,2,2,2,2] dist-2 403.15 5.53 5.99 [3,4,4] dist-2 456.58 5.67 5.93

Next, we evaluate IFL. We have implemented

Algorithm 2 with (or without) bubble pushing in C. By running such a C program, we transform a minimum area logic network synthesized by Synopsys Design Vision to an inversion-‐‑free (or inversion two-‐‑colorable) logic net-‐‑

8

work for the standard (or complex) static/Domino logic. We verify the logic correctness by Synopsys VCS, and evaluate its area, timing performance and power con-‐‑sumption by Synopsys Design Vision. We base on the following cell libraries.

1) For standard static logic, we base on the Nangate 45nm open cell library [27].

2) For complex static logic, we have generated a cell library by merging the static logic Nangate cell library and a resettable static cell library. We generated the resettable static cell library by in-‐‑cluding an additional side input of reset for each cell, and updates its area and delays based on the closest cell in the Nangate cell library (e.g., a NOR2 in the resettable static cell library is based on a NOR3 in the Nangate cell library). The pow-‐‑er consumption estimates are intact.

3) For standard Domino logic, we have generated a footed standard Domino logic cell library based on the Nangate 45nm open cell library. We up-‐‑dated the area of a Domino logic cell according to the transistor count (which drops from 2n to n+2 for a footed inverting dynamic gate), because cell area is largely proportional to the transistor count in the Nangate cell library. We updated the delays of a Domino logic cell as follows. The ris-‐‑ing(falling) delays of an inverting(non-‐‑inverting) cell are given by those of an inverter(buffer). The falling(rising) delays of an invert-‐‑ing(noninverting) cell are given by those of the closest cell with an extra NMOSFET (e.g., ap-‐‑proximate the falling delays of a NAND2 gate in Domino logic by those of a NAND3 gate in static logic). The power consumption estimates are in-‐‑tact.

4) For complex Domino logic, we have generated a cell library by merging the standard Domino log-‐‑ic cell library and the static logic Nangate cell li-‐‑brary.

We count only rising signal delays in a standard stat-‐‑ic/Domino logic network, and only delays of a path which ends with a rising signal transition in a complex stat-‐‑ic/Domino logic network. For timing analysis, we keep an external clock, and have the generated clock as a critical path endpoint.

For each of the FSM testbench circuit, we compare 7 logic implementations:

1) The traditional minimum area static logic im-‐‑plementation where finite states are encoded in a binary code.

2) IFL based on dual-‐‑rail complex static logic. 3) IFL based on standard static logic with an inver-‐‑

sion-‐‑free logic network with outputs in a m-‐‑hot code.

4) IFL based on complex static logic with an inver-‐‑sion two-‐‑colorable logic network with outputs in a m-‐‑hot code.

5) IFL based on dual-‐‑rail complex Domino logic. 6) IFL based on standard Domino logic with an in-‐‑

version-‐‑free logic network with outputs in a m-‐‑hot code.

7) IFL based on complex Domino logic with an in-‐‑version two-‐‑colorable logic network with outputs in a m-‐‑hot code.

Table II gives the total layout area (µμm2), critical path delay(ns), and power consumption (µμW) of these imple-‐‑mentations.

The binary encoding static logic implementations are evaluated based on the worst case (the slow cell library) while the IFL implementations are evaluated based on the average case (the typical cell library). This is because the traditional design methodology relies on guardbanding, i.e., the worst case critical path delay needs to be less than the external clock cycle time, while IFL achieves adaptive high performance. Our observations are as follows.

1) Standard (complex) Domino logic achieves less area for all the test cases and higher performance for most of the test cases4 with comparable pow-‐‑er consumption compared with standard (com-‐‑plex) static logic. On the other hand, static logic achieves enhanced reliability compared with Domino logic: at a soft error occurrence, static logic needs only to hold, while Domino logic needs to re-‐‑compute.

2) Complex static (Domino) logic achieves less area, higher performance and comparable power con-‐‑sumption for all the test cases compared with standard static (Domino) logic.

3) While m-‐‑hot complex static (Domino) IFL achieves the minimum area for the two smallest test cases, dual-‐‑rail static (Domino) IFL achieves the minimum area, the maximum performance and the minimum power consumption for the two largest test cases. This is partly because (1) m-‐‑hot finite state encoding requires a larger area than binary finite state encoding for certain de-‐‑signs, e.g., a counter, and (2) constructing an in-‐‑version-‐‑free logic network needs to duplicate more gates for a larger and more complex logic network. In average, dual-‐‑rail static (Domino) IFL achieves 2.29(2.41)× performance boost, 2.12(1.91)× layout area and 2.38(2.34)× power consumption, while m-‐‑hot complex static (Dom-‐‑ino) IFL achieves 2.19(2.22)× performance boost, 1.98(1.94)× layout area and 4.73(4.75)× power consumption compared with the traditional min-‐‑imum area static logic synchronous design for

9

the 5 FSMs based on the Nangate 45nm open cell library.

4) Dual-‐‑rail static/Domino logic provides further enhanced soft error resilience. Recall that dual-‐‑rail code is not only a DI code, but also a b-‐‑byte distance-‐‑two code. Although a dual-‐‑rail stat-‐‑ic/Domino logic network is not a b-‐‑byte sliced logic network [Smith and Metze 1978], it is a symmetrically intertwined logic network. A sin-‐‑gle error cannot alter an internal logic function and its complement at the same time. As a result, a dual-‐‑rail static/Domino logic network detects all single errors and the majority of multiple er-‐‑rors.

5) Besides achieving performance boost and relia-‐‑bility enhancement, these logic paradigms can al-‐‑so lead to power reduction by scaling down the power supply voltage and trading performance for power consumption reduction. The achieved reliability enhancement enables such power sup-‐‑ply voltage scaling.

6) IFL clears all timing errors in the absence of ex-‐‑ternal soft errors, albeit at a higher area/power cost compared with Razor logic.

TABLE 2. COMPARISON OF TRADITIONAL STATIC LOGIC AND IFL LOGIC DESIGNS IN TOTAL LAYOUT AREA (µm2), CRITICAL PATH DELAY (ns), AND TOTAL POWER CONSUMPTION (µW) FOR 5 FSMS BASED ON THE NANGATE 45nm OPEN CELL LIBRARY.

10-state FSM (fsm cc8 2.v [10])

#FF Area (µm2)

Delay (ns)

Power (µW)

Binary Static 4 81.56 2.60 1.60 Dual-Rail Static IFL 4 167.52 1.02 3.20 2-Hot Standard Static IFL 5 269.78 1.28 5.14 2-Hot Complex Static IFL 5 242.47 1.24 4.24 Dual-Rail Domino IFL 4 150.76 1.01 3.13 2-Hot Standard Domino IFL 5 263.84 1.26 4.76 2-Hot Complex Domino IFL 5 216.67 1.22 4.22 16-state FSM (prep4.v [17]) Binary Static 4 183.21 2.78 2.12 Dual-Rail Static IFL 4 390.14 1.01 6.76 3-Hot Standard Static IFL 6 566.88 1.27 9.58 3-Hot Complex Static IFL 6 507.58 1.18 8.56 Dual-Rail Domino IFL 4 350.75 0.93 6.71 3-Hot Standard Domino IFL 6 523.79 1.25 9.38 3-Hot Complex Domino IFL 6 453.05 1.15 8.37 30-state counter Binary Static 5 124.87 2.46 2.29 Dual-Rail Static IFL 5 254.80 1.17 4.57 3-Hot Standard Static IFL 7 517.77 1.68 7.91 3-Hot Complex Static IFL 7 498.48 1.48 7.42 Dual-Rail Domino IFL 5 230.06 1.15 4.50 3-Hot Standard Domino IFL 7 457.53 1.55 7.69 3-Hot Complex Domino IFL 7 425.60 1.32 6.83 30-state FSM (Control Unit.v [9]) Binary Static 5 178.52 2.59 2.40 Dual-Rail Static IFL 5 395.67 1.38 5.27 3-Hot Standard Static IFL 7 473.17 1.59 15.58 3-Hot Complex Static IFL 7 428.90 1.40 12.87 Dual-Rail Domino IFL 5 354.17 1.26 5.15 3-Hot Standard Domino IFL 7 467.35 1.46 15.57 3-Hot Complex Domino IFL 7 422.25 1.42 12.97 70-state counter Binary Static 7 194.65 3.13 3.12 Dual-Rail Static IFL 7 421.13 1.45 7.84 4-Hot Standard Static IFL 8 480.62 1.95 15.85

4-Hot Complex Static IFL 8 435.32 1.66 16.01 Dual-Rail Domino IFL 7 379.37 1.36 7.67 4-Hot Standard Domino IFL 8 460.85 1.60 15.85 4-Hot Complex Domino IFL 8 425.51 1.57 16.06

7 CONCLUSION Our contributions in this work are as follows.

1) We achieve minimum logic networks of guaran-‐‑teed single soft error resilience. We propose two construction methods for minimum group dis-‐‑tance-‐‑two code, and group-‐‑sliced logic (GSL) networks with outputs in a group distancetwo code, respectively. We achieve guaranteed single soft error resilient logic networks of an average of 1.63× area, 1.63×critical path delay, and 2.17× power consumption, while DMR achieves an av-‐‑erage of 2.12× area, 1.26× critical path delay, and 2.79× power consumption compared with the minimum area design for the 5 FSM test circuits.

2) We propose IFL, which provides a reliable high performance nanoscale computing paradigm by achieving adaptive high performance and soft er-‐‑ror resilience in the presence of performance var-‐‑iability and soft errors. IFL is easy to implement in the existing VLSI design methodologies. An IFL stage forms a synchronous system, or a syn-‐‑chronous module in a GALS system. We expect continuing VLSI performance scaling with the proposed soft/timing error-‐‑resilient VLSI design methodologies.

ACKNOWLEDGMENT This work was supported in part by NSF under grant CCF-1117975.

REFERENCES [1] D. B. Armstrong, “A General Method of Applying Error Correction to

Synchronous Digital Systems,” The Bell System Technical Journal, 40(2), pp. 557-593, 1961.

[2] A. Avizienis, H. Kopetz, and J.-C. Laprie. “The evolution of fault-tolerant computing,” Dependable Computing and Fault Tolerant Sys-tems, 1, 1987.

[3] R. Blish, T. Dellin, S. Huber et al., “Critical Reliability Challeng-es for the International Technology Roadmap for Semiconduc-tors (ITRS),” International SEMATECH. Technology Transfer No. 03024377A-TR, 2003.

[4] C. Bolchini and D. Sciuto, “An output/state encoding for self-checking finite state machine,” In Proc. IEEE Intl. Symp. Circuits and Systems, pp. 2136–2139, 1995.

[5] S. Borkar, “Designing Reliable Systems from Unreliable Compo-nents: The Challenges of Transtor Variability and Degradation,” IEEE Micro, pp. 10-16, 2005.

[6] B. Bose and T. R. N. Rao, “Theory of unidirectional error correct-ing/detecting codes,” IEEE Trans. Computers, C-31(6):521–530, 1982.

10

[7] K. A. Bowman, J. W. Tschanz, S.-L. L. Lu, P. A. Aseron, M. M. Khellah, A. Raychowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik, and V. K. De, “A 45nm resilient microproces-sor core for dynamic variation tolerance,” IEEE J. Solid State Cir-cuits, 46(1):194–208, 2011.

[8] T. J. Brosnan and N. R. Strader II. “Modular error detection for bit-serial multiplication,” IEEE Trans. Computers, 37(9):1043–1052, 1988.

[9] M. D. Ciletti. Advanced Digital Design with the Verilog HDL. Pren-tice Hall, 2003.

[10] C. E. Cummings, “The fundamentals of efficient synthesizable finite state machine design using NC-Verilog and BuildGates,” In Proc. In-ternational Cadence Usergroup Conference, 2002.

[11] K. De, C. Natarajan, D. Nair, and P. Banerjee, “Rsyn: A. system for automated synthesis of reliable multilevel circuits,” IEEE Trans. VLSI Systems, 2:186–195, 1994.

[12] P. G. Depledge, “Fault-tolerant computer systems,” IEE Proc. A, 128(4):257–272, 1981.

[13] S. A. Elkind and D. P. Siewiorek, “Reliability and performance of error correcting memory and register arrays,” IEEE Trans. Comput-ers, C-29(10):920–927, 1980.

[14] D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim, and K. Flautner, “Razor: Circuit-level correction of timing errors for low-power operation,” IEEE MICRO special issue on Top Picks From Microarchitecture Conferences of 2004, 24(6):10–20, 2004.

[15] R. Garg, P. Mathews, and D. Zacher, “Synthesis of fault tolerant circuits for FSMs and RAMs,” In Proc. MAPLD International Con-ference, 2009.

[16] M. Goessel, V. Ocheretny, E. Sogomonyan, and D. Marienfeld. New Methods of Concurrent Checking, Springer, 2008.

[17] S. Golson. “State machine design techniques for Verilog and VHDL,” In Synopsys User Group Conference (SNUG), 1994.

[18] S. Hauck, “Asynchronous design methodologies: An overview,” Proceedings of the IEEE, 83(1):69–93, 1995.

[19] N. K. Jha and S. J. Wang, “Design and synthesis of self-checking VLSI circuits,” IEEE Trans. Computer-Aided Design, 12:878–887, 1993.

[20] B. Liu, “Error-detecting/correcting-code based robust nanoelectronic circuits,” In Proc. NASA/ESA Conference on Adaptive Hardware and Systems, 2010.

[21] G. Mago, “Monotone functions in sequential circuits,” IEEE Trans. Computers, C-22:928–933, 1973.

[22] S. Mitra, M. Zhang, N. Seifert, T. M. Mak, and K. S. Kim, “Soft error resilient system design through error correction,” In IFIP VLSI-SoC, 2006.

[23] T. K. Moon, Error Correction Coding: Mathematical Methods and Algorithms, Wiley-Interscience, 2005.

[24] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” In Proc. International Symposium on the Theory of Switching, pp. 204–243, 1959.

[25] K. Pagiamtzis, N. Azizi, and F. N. Najm, “A soft-error tolerant con-tentaddressable memory (CAM) using an error-correcting-match scheme,” In Proc. Custom Integrated Circuits Conf., pp. 301–304, 2006.

[26] C. B. Schlegel and M. A. Herro, “A burst-error-correcting Viterbi algorithm,” IEEE Trans. Communications, 38(3):285–291, 1990.

[27] Silicon Integration Initiative (SI2), Nangate Open Cell Library. www.si2.org/openeda.si2.org/projects/nangateli.

[28] M. Singh and S. M. Nowick, “MOUSETRAP: Ultra-high-speed tran-sition signaling asynchronous pipelines,” 15(6):684–698, 2007.

[29] J. E. Smith and G. Metze, “Strongly fault secure logic networks,” IEEE Trans. Computers, 27(6):491–499, 1978.

[30] E. S. Sonomonyan, “Design of built-in self-checking monitoring circuits for combinational devices,” Automation and Remote Control, 35(2):280–289, 1974.

[31] F. Sun, S. Devarajan, K. Rose, and T. Zhang, “Multilevel flash memory on-chip error correction based on trellis coded modulation,” 2006.

[32] I. E. Sutherland, “Micropipelines,” Communications of the ACM, 32(6):720–738, 1989.

[33] T. Verhoeff, “Delay-insensitive codes – an overview,” Distributed Computing, 3:1–8, 1988.

[34] N. H. E. West and D. M. Harris. CMOS VLSI Design: A Circuits and Systems Perspective, 4th Edition, Addison-Wesley, 2011.

[35] M. Zhang, S. Mitra, et al., “Sequential element design with build-in soft error resilience,” IEEE Trans. VLSI Systems, 14(12):1368–1378, 2006.

[36] J. Zheng, S. Katanyoutanant, and M. Le, “Safe and efficient one-hot state machine,” In Proc. MAPLD International Conference, 2005.

B. Liu (BS’93,MS’96,Ph.D.’03) is an assistant professor at the Uni-versity of Texas at San Antonio. He serves as co-chair for the Emerging/Innovative Process & Device Technologies and Design Issues (EDT) session in International Symposium on Quality Elec-tronic Design (ISQED) since 2006, co-chair of the Photovoltaics Technology session in Asia Symposium on Quality Electronic Design (ASQED) since 2010, and invited session chair for Emerging Nano-Circuits and Systems in IEEE International Midwest Symposium on Circuits and Systems (MWSCAS) in 2010. Dr. Liu is the receipient of a Best Paper Award in International Conference on Computer De-sign in 2005, and a Best Research Award in UCSD Research Re-view 2002. He has published over 50 journal articles and confer-ences papers. His research interests include VLSI variability and reliability analysis, robust, high performance and low power design, nanoelectronic architecture, and emerging technologies. Dr. Liu is a senior memberof IEEE. L. Wang (BS’09,MS’11) is currently pursuing her Ph.D. degree at the University of Texas at San Antonio. Her research interests in-clude VLSI statistical timing analysis, delay test, and performance optimization. She is a student member of the IEEE. F. Teshome (BS’09,MS’11) currently works with IBM at Tucson, AZ. She is a member of the IEEE.

Documents

Error-Detecting Code-Based Minimum Logic of Guaranteed Timing/Soft Error Resilience