16725463

Embed Size (px)

Citation preview

  • 7/28/2019 16725463

    1/4

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998 173

    A Novel Design of a Two Operand Normalization Circuit

    Elisardo Antelo, Montserrat Boo, Javier D. Bruguera,

    and Emilio L. Zapata

    AbstractThis paper presents a new design for two operand normaliza-tion. The two operand normalization operation involves the normalizationof at least one of two operands by left shifting both by the same amount.

    Our design performs the computation of the shift by making an OR ofthe bits of both operands in a tree network, encoding the position of

    the first nonzero bit. The encoded position is obtained most significantbit first, and then there is an overlapping with the shifting operation.The design we propose replaces two leading zero detector circuits and acomparator, that are present in the conventional approach. Our schemedemonstrates to be more area efficient than the conventional one. Thecircuit we propose is useful in floating point complex multiplication andCOordinate Rotation DIgital Computer (CORDIC) processors.

    Index TermsDigital VLSI design, floating point operations, leadingzero detector circuit, normalization.

    I. INTRODUCTION

    There are many applications where a two operand normalization

    step is needed [3], [5], [7]. This implies the shift of the two

    operands until one of them is normalized in a certain range (i.e.,

    [1, 2) for IEEE floating point arithmetic). Typical applications that

    require this operation are floating point operations over complex

    numbers like multiplication [7] or addition/subtraction, fixed-point

    CORDIC (COordinate Rotation DIgital Computer) arithmetic for

    angle calculation in SVD (singular value decomposition algorithm)

    [4], [5], and the normalization of one operand expressed in carry-save

    redundant arithmetic, useful in arithmetic coders [1] and to detect the

    sign of a carry-save operand.

    Conventional two operand normalization requires leading zero

    detection, comparison and shifting. In this paper, we present an

    area efficient design for the two operand normalization problem

    based on the encoding of the position of the first nonzero bit from

    the most significant side of both operands by means of a tree

    network of OR operations. As the position of the first nonzero bit

    is obtained most significant bit first, there is an overlapping with

    the shift of the operands. Our scheme avoids the comparison of

    the number of leading zeros of both operands, resulting in a more

    area efficient implementation maintaining the speed. The structure

    of the paper is as follows. Section II reviews the conventional two

    operand normalization. In Section III, we propose the new scheme. In

    Section IV, we compare both the conventional and the new design.

    Section V presents two applications, and finally Section VI reports

    some conclusions.

    II. CONVENTIONAL TWO OPERAND NORMALIZATION

    Two operand normalization implies to left shift both operands in

    such a way that at least one of them has a one (or a zero if the

    Manuscript received April 15, 1995; revised April 15, 1996. This work wassupported in part by the Ministry of Education and Science (CICYT) of Spainunder contract TIC96-1125.

    E. Antelo, M. Boo, and J. D. Bruguera are with the Department Electronicae Computacion of the University of Santiago de Compostela, Santiago deCompostela, Spain.

    E. L. Zapata is with the Department Arquitectura de Computadores,University of Malaga, Malaga, Spain.

    Publisher Item Identifier S 1063-8210(98)01314-6.

    Fig. 1. Conventional two operand normalization.

    numbers are negative twos complement) in the most significant bit

    position. For a low complexity implementation the left shift can

    be done in several cycles making a single left shift in each cycle,

    and then, it is only necessary to check the most significant bit of

    both operands in each cycle. This scheme has a very low hardware

    complexity but requiresO ( n )

    cycles for normalization (wheren

    is

    the wordlength of the operands) and there is a variable number of

    cycles for normalization (data dependent processing) leading to a

    more complex control of the whole system. In this paper we consider

    a more efficient architecture that produces the result in one cycle.

    Fig. 1 shows a fast architecture for the two operand normalization, in

    one cycle, of two floating point operands with 24 bit significants and

    8 bits for the common exponent (this scheme is used, for example,

    in floating point multiplication of complex numbers [7]). Although

    this is a particular implementation, it is easy to generalize it to any

    number of bits.

    Two leading zero detector (LZD) circuits [6] encode the position of

    the first non zero bit from the most significant side in each operand.

    Then, by means of a comparator the smallest of the encoded values is

    determined. The position of the first non zero bit is exactly the shift

    that has to be performed to normalize at least one of the operands.

    The encoded position has a binary expressionS = d

    4

    d

    3

    d

    2

    d

    1

    d

    0

    . This

    10638210/98$10.00 1998 IEEE

  • 7/28/2019 16725463

    2/4

    174 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998

    Fig. 2. New algorithm for two operand normalization.

    number is used to left shift both operands and it is also subtracted

    from the exponent. The binary value ofS

    hasl o g

    2

    ( p )

    bits in a

    general case, wherep

    is defined asp = 2

    d l o g ( n ) e forn

    bit operands.

    Observe thatp

    is the largest integer power of two greater or equal to

    n

    , and thereforep

    is always even.

    The left shift is usually performed by means of a barrel shifter

    to normalize the operands in one cycle [9]. The barrel shifter that

    we consider in this paper has to allow hardwired left shifts by

    p = 2 ; p = 4 ; 1 1 1 ; 1 in different stages (multiplexers). In Fig. 1, we haven = 2 4

    andp = 3 2

    . After the normalization process, a rounding

    of the operands must be carried out for floating point arithmetic.

    Moreover, hardware for overflow/underflow detection in floating

    point arithmetic is needed (not shown in the figure).

    As shown in Fig. 1, if the significants are expressed in twos

    complement, a module is necessary to perform the ones complement

    of the operands when the sign bit is one, to encode the position of

    the first zero. This module is not necessary in the case of sign and

    magnitude representation for the significants, like the IEEE standard

    for floating point arithmetic. The scheme showed in Fig. 1 can be

    easily adapted to deal with integer values, both unsigned (eliminating

    the module for the ones complement), and signed twos complement.

    For this cases the hardware for the exponent and rounding are

    eliminated.

    III. A NEW ALGORITHM FOR THE TWO OPERAND NORMALIZATION

    In this section we describe the new algorithm for two operand

    normalization. We present an scheme that replaces the two LZD

    circuits and the comparator present in the conventional architecture.

    For the sake of simplicity we develop the algorithm for the two

    operand normalization of integer unsigned numbers, but it can be

    generalized for twos complement signed integers, and floating point

    numbers, just replacing the two LZD circuits and the comparator in

    Fig. 1 by the new scheme.

    The basic operation of our algorithm is to encode the position

    of the first non zero bit from the most significant side of both

    operands by performing an OR operation over the two operands

    in a tree network. The encoded position is produced msb (most

    significant bit) first, so the shifting operation begins when the most

    significant bit of the encoded position is available, and then there is an

    overlapping between the shift computation and the shifting operation.

    The algorithm we present has a binary tree structure where a newd

    i

    (beginning with the msb ofS

    ) is obtained in each decision of the tree.

    Fig. 2 shows the binary tree that represents our algorithm for the two

    operand normalization. In this figure we present the algorithm for 16bit operands, but the structure of the algorithm can be extended easily

    to any number of bits (not necessary a power of two). Note that in

    the case of a non power of two wordlength of the operands, the tree

    would not be symmetrical. We also show in this figure, an example

    for the normalization of two operands. Note the path followed in the

    tree for this case to find the suitable shift. For each decision in the

    tree we show the result of the OR operation over certain bits, and

    the suitable value ford

    i

    .

    The encoding of the position of the first non zero bit is as follows.

    First we check thep = 2

    (OR operation over 8 bits for p = 1 6 ) more

    significant bits of both operands. If all the bits are zero, then the

    position of the first non zero bit in one of the operands is greater

    thanp = 2

    and thend

    3

    = 1

    (that is a left shift by2

    p = 2 have to be

    done). Else d 3 = 0 since the position of the first non zero bit iswithin the

    p = 2

    most significant bits. To determine the value ofd

    2

    ,

    p = 4

    bits of each operand have to be checked. The position of these

    p = 4

    bits is a function ofd

    3

    . Ifd

    3

    = 0

    , then thep = 4

    msb have to

    be inspected. Else, thep = 4

    bits following thep = 2

    msb have to be

    checked. The remaining bits of the encoded position are determined

    following a similar procedure untild

    0

    is obtained. In the example

    shown in Fig. 2 we have drawn an square over the bits inspected in

    each case.

    The algorithm we propose can be mapped efficiently in a combi-

    natorial network. The normalization architecture based on the new

    algorithm is depicted in Fig. 3 forn = 1 6

    (p = 1 6

    ). We distinguish

    two parts, one corresponding to the shift calculation and another to the

  • 7/28/2019 16725463

    3/4

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998 175

    TABLE ICOMPARISON BETWEEN CONVENTIONAL AND NEW SCHEME

    Fig. 3. Hardware implementation of the new algorithm.

    barrel shifters that perform the shift. For the calculation of the shift

    we implement the OR of all possible groups of bits shown in Fig. 2,

    and the correct path in the tree is selected by means of multiplexers

    controlled by the value obtained ford

    i

    in each step. To reduce the

    hardware complexity of the algorithm, bitd

    0

    is computed by checking

    the most significant bit of both operands after thel o g

    2

    ( p ) 0 1

    initial

    shifts in the barrel. For a general case, the circuit is made up of a

    tree of OR gates withl o g

    2

    ( p )

    levels. There arel o g

    2

    ( p ) 0 2

    levelsof multiplexers.

    IV. COMPARISON

    In this section we perform a comparison of our design and the

    conventional design described in section II. We have made the

    comparison based on two different approaches. On one hand we have

    performed a VLSI implementation in 0.7

    m CMOS double metal

    technology of both schemes. On the other hand we have evaluated

    the architectures based on a model independent of the technology. In

    this model the hardware complexity is evaluated counting the gates

    needed for the implementation and the delay is computed in gate

    levels. For these comparisons we have considered the two operand

    normalization of unsigned integer numbers withn = 3 2

    .

    We now give the components of the critical path of both designs.

    We have introduced buffers in the lines that control the multiplexers

    of the barrel shifters due to its large fanout. The critical path in the

    conventional design is

    T

    c o n v

    = t

    l z d

    + t

    c o m p

    + t

    b u e r

    + t

    b 0 s h

    (1)

    wheret

    l z d

    is the delay of an LZD circuit,t

    c o m p

    is the delay of a

    comparator,t

    b u e r

    is the delay of a buffer andt

    b 0 s h

    the delay of a

    barrel shifter. For the case of our architecture we have

    T

    n e w

    = t

    l 0 o r

    + t

    l 0 m u x

    + t

    b u e r

    + t

    m u x

    + t

    o r

    + t

    b u e r

    + t

    m u x

    (2)

    wheret

    l 0 o r

    is the delay ofl o g

    2

    ( p )

    OR gate levels,t

    l 0 m u x

    is the

    delay ofl o g

    2

    ( p ) 0 2

    2-to-1 multiplexer levels,t

    m u x

    is the delay

    of a 2-to-1 multiplexer, andt

    o r

    is the delay of a two input OR

    gate. Note thatt

    b 0 s h

    (delay of the barrel shifter) does not appear

    in the expression of the critical path due to the overlapping with the

    computation of thed

    i

    bits.

    Table I shows the area and delay obtained for the VLSI imple-

    mentation of the conventional and new design. We have performed

    a standard-cells based implementation (ES2 ECPD07 [2]) with the

    DFWII (Cadence) design tool. As we can see both schemes have

    similar speed, but the area devoted to the shift computation in the

    conventional design is 2.91 times the area devoted to the same

    function in our design, and the total area, including the barrel shifters,

    has been reduced by a factor 1.17 with respect to the conventional

    approach.

    For the technology independent evaluation we have assumed that

    anx

    -input gate (nand, nor) has a delay ofx = 2

    and complexity ofx = 2

    gates [8]. For the inverters we have considered 0.5 gate delays and

    0.5 gates of complexity. For the buffers we have assumed a delay of

    1 + m = 8

    gate delays, wherem

    is the number of cells driven by the

    buffer. We have considered a complexity of 5 gates for the buffer.

    These are reasonable assumptions for many technologies, although

    there are many other alternatives when considering a particular

    technology. Table I shows the result of this evaluation. The number

    of gate levels are similar, but the gate count for the shift computation

    is reduced by factor 2.8 in favor of our design.

    Although one of the comparisons is technology dependent and the

    other one is very rough, the results obtained indicate that the new

    design reduces the area requirements (or gate count) maintaining the

    same speed as the conventional one. Observe the significant saving

    in area (or gates) in the shift computation part.

    V. APPLICATIONS

    In this section we briefly discuss two interesting applications for

    the scheme we propose, although many others are possible.

    1) Floating Point Multiplication of Complex Numbers: In digital

    signal processing (DSP), many signals are efficiently repre-

    sented as complex numbers. To increase the dynamic range of

  • 7/28/2019 16725463

    4/4

    176 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 1, MARCH 1998

    these signals, a floating point representation is used, leading to

    floating point operations in digital signal processors. In [7], a

    floating point multiplier for complex numbers is presented. The

    floating point representation of the complex numbers consist in

    two significants with their own sign bits, and one common

    exponent for both real and imaginary parts. This representation

    proves to be efficient from the point of view of memory

    requirements, and the processing is more efficient since only

    one normalization and rounding process is necessary at the end.After the multiplication both the real and imaginary parts of the

    result can be denormalized, due to cancelation in the additions,

    and a normalization and rounding process is needed. Since the

    real and imaginary parts have a common exponent, the numbers

    are assumed to be normalized to the greater of two, so a two

    operand normalization is necessary. The scheme we propose

    in this work can be used efficiently in this structure reducing

    the hardware requirements for normalization, maintaining the

    speed of the circuit. Since the complex number multiplier is

    a key hardware element in DSP, the proposed circuit for two

    operand normalization can be widely used in this field. Our

    scheme for normalization can also be used in floating point

    addition/subtraction of complex numbers.

    2) Angle Calculation with CORDIC: This is an iterative algo-rithm to perform plane rotations and to evaluate trigonometric

    functions [4], [5]. The basic iteration is based on shift and

    addition operations. Many CORDIC-based algorithms have

    been proposed for signal processing, image processing, matrix

    algebra and robotics [4]. In [5], it has been shown that if the

    input vector is not normalized, then there can be large errors

    in the computation of the inverse tangent function (vectoring

    mode). This situation occurs in the evaluation of the SVD of

    a matrix based on fixed-point CORDIC arithmetic [5]. For

    the computation of the SVD, fixed-point format can be used

    without loss of accuracy. This way the overhead due to floating-

    point processing is avoided. However as the processing is in

    fixed-point format, the processors that evaluate the inverse

    tangent function have to incorporate, internally, a two operandnormalization circuit to avoid errors.

    VI. CONCLUSIONS

    We have presented a novel design for two operand normalization.

    The new architecture is based on the computation of the shift from the

    most significant bit to the least significant one. This technique allows

    significant savings in area because it is not necessary to detect the

    number of leading zeros of the two operands and then to perform a

    comparison as in the conventional design. The main advantage of our

    architecture is the low cost in area as compared to the conventional

    one but maintaining time performance. The proposed circuit can be

    used efficiently in floating point operations over complex numbers

    and CORDIC processors, although other applications are possible.

    ACKNOWLEDGMENT

    The authors thank the referees for their valuable comments.

    REFERENCES

    [1] D. Chevion, E. D. Karnin, and E. Walach, High efficiency mul-tiplication free approximation of arithmetic coding, in Proc. DataCompression Conf., 1991, pp. 4352.

    [2] European Silicon Structures,in ES2 ECPD07 Library Databook, July1993.

    [3] D. Goldberg, What every computer scientist should know aboutfloating-point arithmetic, ACM Computing Surv., vol. 23, no. 1, pp.547, Mar. 1991.

    [4] Y. H. Hu, CORDIC-based VLSI architecture for digital signal process-ing, IEEE Signal Process. Mag., pp. 1635, July 1992.

    [5] K. Kota and J. R. Cavallaro, Numerical accuracy and hardwaretradeoffs for CORDIC arithmetic for special purpose processors, IEEETrans. Comput., vol. 42, no. 7, pp. 769779, July 1993.

    [6] V. G. Oklobdzija, An algorithmic and novel design of a leading zerodetector circuit: Comparison with logic synthesis, IEEE Trans. VLSISyst., vol. 2, pp. 124128, Mar. 1994.

    [7] V. G. Oklobdzija, D. Villeger, and T. Soulas, An integrated multiplierfor complex numbers, J. VLSI Signal Processing, no. 7, pp. 213222,1994.

    [8] H. R. Srinivas and K. Parhi, A fast radix-4 division algorithm and itsarchitecture, IEEE Trans. Comput., vol. 44, no. 6, pp. 826831.

    [9] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design:A System Perspective, 2nd ed. Reading, MA: Addison-Wesley, 1993.

    Dynamic Fault Dictionaries and Two-Stage

    Fault Isolation

    Paul G. Ryan and W. Kent Fuchs

    Abstract This paper presents dynamic two-stage fault isolation forsequential random logic very large scale integrated (VLSI) circuits, andintroduces limitedand dynamic fault dictionaries. In the first stage of thedynamic process, a limited fault dictionary identifies candidate faults,which are further distinguished in the second stage by a dictionarygenerated dynamically for the candidate faults and a subset of the testvectors. This provides high resolution but avoids the costs of full staticdictionaries. Two-stage fault isolation is evaluated for benchmark circuitsand on defects in industrial circuits.

    Index TermsCAD, diagnosis, dynamic fault dictionaries, fault isola-

    tion, testing.

    I. INTRODUCTION

    Fault dictionaries have effectively located real defects in industrial

    circuits [1], [2], but their use for diagnosis of large, sequential,

    random logic very large scale integrated (VLSI) circuits has been

    limited by size and cost-related problems. A full dictionary [1][3]

    contains a record for each test vector of the errors a circuits modeled

    faults caused in simulation. Matching algorithms isolate defects by

    comparing errors seen on a tester to those recorded in the dictionary,

    identifying the most likely modeled faults [4]. Typically, this is

    followed by a physical examination to verify and analyze the defect.

    Integrated computer-aided design (CAD) tools have made it possible

    to investigate a set of most likely defects.

    Manuscript received September 15, 1995; revised September 15, 1997. Thiswork was supported in part by the Semiconductor Research Corporation (SRC)under Grant 93-DP-109, by the Joint Services Electronics Program (JSEP)under Grant N00014-90-J-1270, and by the Intel Corporation. This paper waspresented in part at the IEEE Asian Test Symposium, 1993.

    P. G. Ryan is with Intel Corporation SC9-09, Santa Clara, CA 95052 USA.W. K. Fuchs is with the School of Electrical and Computer Engineering,

    Purdue University, West Lafayette, IN 47907 USA.Publisher Item Identifier S 1063-8210(98)01316-X.

    10638210/98$10.00 1998 IEEE