A Low-Complexity Viterbi Decoder

Embed Size (px)

Citation preview

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    1/13

    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010 873

    A Low-Complexity Viterbi Decoder forSpace-Time Trellis Codes

    Kai-Ting Shr, Student Member, IEEE, Hong-Du Chen, and Yuan-Hao Huang, Member, IEEE

    AbstractSpace-time trellis code (STTC) has been widelyapplied to coded multiple-input multiple-output (MIMO) systemsbecause of its gains in coding and diversity; however, its greatdecoding complexity makes it less promising in chip realizationcompared to the space-time block code (STBC). The complexityof STTC decoding lies in the branch metric calculation in theViterbi algorithm and increases significantly along with thenumber of antennas and the modulation order. Consequently, alow-complexity algorithm to mitigate the computational burden isproposed. The results show that more than 70%, 78%, and 83%of the computational complexity is reduced for 2 2, 3 3, and4 4 MIMO configurations, respectively. Based on the proposedalgorithm, a reconfigurable MISO STTC Viterbi decoder is de-

    signed and implemented using 0.18 m 1P6M CMOS technology.The decoder achieves 11.14 Mbps, 8.36 Mbps, and 5.75 Mbps for4-PSK, 8-PSK, and 16-QAM modulations, respectively.

    Index TermsBranch metrics, MIMO, space-time trellis code,Viterbi decoder.

    I. INTRODUCTION

    IN RECENT years, multiple-input multiple-output (MIMO)transmission technology has been widely applied to var-

    ious wireless communication systems, such as IEEE 802.11n

    WiFi [1] and IEEE 802.16e WiMAX [2]. MIMO technology isdivided into two categories, spatial multiplexing and diversity

    coding [3]. In the spatial multiplexing technique, the data is splitinto multiple streams, which are transmitted and received by

    multiple antennas. Subsequently, the receiver detects the trans-mitted symbols from the signals received by the multiple re-ceiving antennas. This kind of MIMO technique can increase

    channel capacity and data rate by using more transmitting an-tennas [4][6]. In addition, the diversity coding technique has abetter capability of resisting the channel impairment. The mostpopular diversity coding technique is space-time coding (STC)which involves space diversity, modulation, and error correction

    [7]. TheSTC can moderately improve the spectral efficiency andprovide coding gains for error correction [8]. Alamouti [9] pro-posed a two-transmit-antenna scheme, which is then extended

    to space-time block codes (STBC) [7] and has been adopted in

    Manuscript receivedJanuary 19, 2009; revised April30, 2009. FirstpublishedDecember 22, 2009; current version published April 09, 2010. This work wassupported in part by the National Science Council, Taiwan, under Grant NSC96-2219-E-007-013 and Grant NSC 96-2220-E-007-014. This paper was rec-ommended by Associate Editor A. Strollo.

    K.-T. Shr is with the Department of Eelectrical Engineering, NationalTsing-Hua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).

    H.-D. Chen was with the Institute of Communications Engineering, NationalTsing-Hua University, Hsinchu 30013, Taiwan. He is currently in military ser-vice in Taiwan.

    Y.-H. Huang is with the Institute of Communications Engineering and De-partment of Electrical Engineering, National Tsing-Hua University, Hsinchu30013, Taiwan (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TCSI.2009.2027648

    wireless standards [1], [2], to enhance the performance of wire-less communication systems. However, it only improves the di-versity gain rather than the channel capacity of the system. Thespace-time trellis code (STTC) [10][12] was also proposed to

    improve both the diversity gain and coding gain for wirelesscommunication systems. The STTC encoder generates redun-dant parity check codes which are transmitted with the originalinformation data streams, and thereby coding gain is obtained.

    Therefore, the STTC technique possesses a more robust capa-bility than the STBC technique for combating severe MIMOchannel impairment. However, the STTC decoder employs the

    Viterbi algorithm which requires much greater decoding com-plexity than the STBC decoder [13], thus the STTC is seldomadopted as the diversity coding technique in current wireless

    communication systems.

    In the literature, many Viterbi decoder chips have been pub-

    lished for convolutional codes (CC) and trellis-coded modu-

    lation (TCM) techniques. A 32-state radix-4 Viterbi decoder

    was first proposed and implemented in [14], and many suc-

    ceeding works have proposed methods of improving the per-

    formance of the CC Viterbi coder. Some approaches enhance

    the throughput performance [15] or reduce the power consump-

    tion using specific techniques, such as look-ahead calculation

    [16] or improved register-exchange architectures [17]. SeveralCC Viterbi decoders aim to support the larger number of states

    [18], [19] or different decoding algorithms [20] in certain com-

    munication standards. In [21], an adaptive Viterbi decoder can

    be reconfigured dynamically in response to the varying channel

    conditions so as to reduce power consumption. On the other

    hand, Viterbi decoders for the trellis-coded modulation (TCM)

    technique, which involves error-correcting coding and modula-

    tion, have been designed and implemented in recent years. The

    TCM Viterbi decoder requires a greater decoding complexity

    than the CC Viterbi decoder, thus more efforts must be made

    to reduce hardware costs [22], [23]. An ASIC chip [24] has

    been proposed to support different communication specifica-

    tions under a variety of channel conditions. Moreover, TCM

    Viterbi decoders for WiFi and DVB have been proposed in [25]

    and [26], respectively.Compared to the traditional Viterbi decoder for convolutional

    codes and TCM codes, the STTC Viterbi decoder has to perform

    a great amount of complex-value multiplications for branch-metric computation, which is proportional to the modulationorder and the number of antennas. Efficient STTC decoders withseveral antennas for the MIMO/MISO systems were proposed

    in [27], [28]; however, the STTC decoder architecture is seldomaddressed or implemented.

    To overcome this implementation bottleneck, an efficient

    tree-search algorithm and a constant multiplier architecture

    1549-8328/$26.00 2010 IEEE

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    2/13

    874 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

    Fig. 1. Space-time trellis-coded MIMO system.

    [29], [30] is proposed in order to reduce the complexity ofbranch metric calculation in the STTC Viterbi decoder. The

    complexity can be reduced to such a degree that the cost of im-plementation is reasonable compared with the available STBCdecoder [13]. Then, the STTC Viterbi decoder is implementedbased on the proposed algorithm using 0.18 m 1P6M CMOS

    technology. To the authors knowledge, this is the first STTCdecoder chip to be proposed and implemented.

    This paper is organized as follows. The proposed STTC de-coding algorithm is introduced in Section II. The STTC decoderarchitecture based on the proposed algorithm is then presentedin Section III. The chip implementation and measurement re-sults are demonstrated and discussed in Section IV. Finally, a

    conclusion to the work is given in Section V.

    II. PROPOSED STTC DECODING ALGORITHM

    A. Introduction of Space-Time Trellis Codes

    The space-time trellis coded MIMO system, as depictedin Fig. 1, involves modulation, error correction, and diversity

    techniques. bits of data are encodedin the transmitter at time . Then, the STTC-encoded symbols

    are transmitted by transmittingantennas. In the receiver, receiving antennas acquire thesignal , which is then decoded by the

    STTC decoder.1) STTC Encoder: For -PSK or -QAM modulation,

    the STTC encoder receives input bit sequences, as shown in

    Fig. 2. Each bit sequence is buffered with delay elementsand encoded with a set of generator coefficients similar tothe convolutional code for the -th transmitting antenna, where

    for the -th bit sequence and

    . Then, the encoded symbols can be expressed by

    The generator coefficients are properly designed using the tracecriterion [7] according to the channel properties and are listedin the Appendix .

    2) MIMO Channel Model: In this paper, we assume that theencoded symbols are impaired by the Rayleigh fading channel,

    and the received signal at time is modeled by

    where is the channel gain matrix and is thenoise vector consisting of all white Gaussian variables. Each

    Fig. 2. The STTC encoder.

    element in is an independent and identical complex Gaussianrandom variable with zero mean and unity variance.

    3) STTC Decoder: The STTC decoder uses the Viterbi algo-rithm to detect the coded multiple streams that are transmittedthrough the MIMO channel. Note that the STTC trellis per-tains to the modulation order in the state diagram, and there

    are transition paths from the previous states for -PSK or-QAM. Therefore, the modulation order has a great impact

    on the computational complexity of the branch metric calcula-tion and the add-compare-select (ACS) computation.

    Branch Metric Calculation: In the STTC Viterbi decoder,the branch metric is defined as the Euclidean distance betweenthe actually received symbol and the faded candidate symbol

    corresponding to each trellis transition as follows,

    (1)

    where is the received symbol in the -th receiving antenna attime ; is the channel response between the -th receiving

    antenna and the -th transmitting antenna at time ; and is the

    candidate symbol from the -th transmitting antenna at time .The branch metric for each trellis path is accumulated and com-

    pared to obtain the optimal candidate symbol. It can be seen thatthe computational complexity of the branch metric calculation

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    3/13

    SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 875

    increases along with the number of antennas and the modulation

    order.Add-Compare-Select: The add-compare-select block

    compares the accumulated path metrics and determines the

    optimal path, which is recorded and updated in the memory.The STTC Viterbi decoder requires -input ACS operations

    for -PSK or -QAM modulation. The expanding modu-lation order greatly increases the ACS hardware costs andcomputation delay time and thus limits the throughput in theViterbi decoder.

    In summary, the number of transmitting and receiving an-tennas and the modulation order determine the computationalcomplexity of the STTC decoder. In the following, branch

    metric calculation methods and a new ACS architecture forreducing the computational complexity is proposed.

    B. Proposed Branch Metric Calculation Methods

    Two methods for branch metric calculation are proposed.Method I performs the branch metric calculation when the

    channel matrix is updated for each signal vector in the fastfading channel, and Method II computes the branch metrics ifthe channel remains fixed during a frame period in the slowfading channel.

    1) Method I: Method I aims to simplify the computa-

    tional complexity for the branch metrics because a greatamount of the complexity lies in the complex-value mul-tiplication . Method I separates the complex-value

    multiplication into two real-value multiplica-

    tion equations, and

    . Because the values of

    and are fixed for a specific candidate symbol, we caneasily calculate the product with simple shift-and-add opera-

    tions. For example, in the 4-PSK STTC code in Appendix ,is 1, , or for two transmitters. Its trellis diagram and I/Qmapping of the faded symbols are depicted in Fig. 3(a) and (b).Then, faded candidate symbols , , , and can be

    simplified and listed in Table I. Thus, the complex-value com-putation is reduced to the real-value addition of and

    . Similarly, this method can be applied to 8-PSK and16-QAM STTC codes, and then only constant multiplication isrequired in order to compute the faded symbols.

    2) Method II: In the slow Rayleigh fading channel, thevalues of the channel response is constant for a specific timeinterval; therefore, the value of remains fixed in the time

    interval which is defined as a frame in the generator coefficientdetermination [7].

    At the beginning of each frame, the received symbol andthe channel-impaired candidates ( ) are mapped

    onto the I/Q plane. Then, the nearest node ( ) to in thebinary-tree structure is determined, as shown in Fig. 3(c). Theperpendicular bisector used to detect the winner of each binarycomparison between each pair of nodes and can be ex-

    pressed as follows:

    (2)

    Fig. 3. (a) Trellis diagram. (b) I/Q mapping of the candidate symbols for the4-PSK modulation. (c) Binary-tree comparison.

    TABLE ICOORDINATES OF NODES FOR 4-PSK

    After the winner is determined, Method II replaces the valueof in(1) with the valueof the nearest candidate , thatis, the

    distance between and ( , ) is takenastheactual branchmetric between and (

    , ). Once the distance between any two nodes and thebisector equations are precomputed in the first symbol of each

    frame, only the distance between the nearest candidate andneeds to be computed for the remaining received symbols in theframe, and thus the computational complexity can be reduced.

    Method II is used to compute the branch metrics for an ex-ample of the 4-state 4-PSK STTC with 2Tx and 1Rx, whose

    trellis transition is shown in Fig. 3(a). In the original branchmetric calculation, branch metrics , , , and are

    calculated using the following equations:

    (3)

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    4/13

    876 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

    In Method II, the received and the candidate symbols

    are mapped on the I/Q plane as and re-spectively, as shown in Fig. 3(b). The faded candidate symbols

    are computed in the same way as Method I.To determine the nearest faded symbol node to , perpendicularbisectors , , and are derived as follows:

    (4)

    Then , the nearest faded node to , can be obtained usingthe tree-search comparison, and , , and are replaced

    with , , and , respectively, as follows.

    (5)

    The coordinates of the faded candidate symbols, the distancesbetween any two nodes, and all perpendicular bisectors are pre-

    computed at the beginning of a frame and saved in memory untilthe end of the frame. Therefore, these precomputed distancescan be used without the need for additional branch metric cal-culations for all other loser nodes in the tree-search comparison.

    Therefore, if the frame is large, a great amount of complexity

    can be reduced.The idea of precomputing slow-changing or common infor-

    mation is popular in the digital signal processing algorithm forcommunication. Typical examples are channel-estimation for

    OFDM system and preprocessing for the MIMO detector, suchas QR decomposition and lattice reduction. People usually per-form these operations block-wise, i.e., all computations in these

    functions are performed once and afterward their results are uti-lized for a frame. In the proposed Method II, the branch metriccalculation block itself is not slow-changing or common duringa frame because it must compute branch metrics for successive

    input signals. We just try to use the nearest node (fixed for aframe) to replace the received signal (variant for a frame) so that

    the branch metric computation can be approximated by usingthe fixed node-distances which are precomputed in the initial

    symbols of a frame. In the implementation of the algorithm and

    the hardware architecture, the branch metric calculation blockpartially generates the results with the precomputed constantnode-distances and partially computes the branch metric of the

    winner node for the sucessive input signals during a frame. Thisis the main difference from other similar design concept in the

    digital signal processing for communication.

    C. Performance Analysis

    The STTC Viterbi decoder using the proposed branch metriccalculation methods was simulated and analyzed. The framesize is assumed to be 130 received symbols for slow fadingSTTC codes. The frame error rate (FER) performances for dif-

    ferent modulation schemes are shown in Fig. 4. It can been seenthat Method I does not degrade performance because only thesimplification of the complex-value multiplication is applied.Moreover, if Rayleigh fading channels are slow, Method II can

    further reduce the computational complexity at cost of degrada-tion in coding gain.

    D. Complexity Analysis

    The computational complexities of the proposed branchmetric calculation methods are analyzed for the differentMIMO configurations. It is assumed that the channel response

    remains fixed in a frame, which contains 130 symbols in theslow fading channel and only one symbol in the fast fadingchannel. Fig. 5 depicts the computational complexities for

    16-QAM under different MIMO configurations. Note thatMethod I is equivalent to the traditional Viterbi algorithm;thus, the total number of operations is not reduced. However,the constant portion represents the simplified multiplicationsand occupies almost 50% of the total number of computations.

    Because the constant multiplications can be implemented usingshift-and-add operations, cost and power can be greatly reducedin the VLSI implementation. If the fading channel is slow, bothmethods can reduce the majority of constant operations because

    the faded candidate symbols are computed only once ina single frame. Furthermore, more operations can be reducedin Method II because the computations of the branch metrics

    for the loser nodes are replaced by accessing the precomputeddistances in memory. It can seen that the computational com-plexities can be reduced by 70%, 78%, and 83% for 2 2,3 3, and 4 4 MIMO configurations. Note that the computa-

    tional complexities can be further reduced by more than 90% in

    the 8 8 MIMO using Method II in the slow fading channel.The computational complexities of the proposed methods

    have a close relationship with the modulation order and theframe size, as shown in Fig. 6. For Method I, the metric values

    are computed for each received symbol. Thus, the computa-tional complexity tends to converge as the frame size increases.For Method II, the precomputation of the distances between

    any two nodes and the bisector equations requires a largeamount of complexity in the initially received symbols in aframe. However, the computational complexities are averagedand become lower as the frame size increases because the

    latter distance results are accessed from the memory withoutrequiring any computation. Therefore, more computation cost

    can be reduced. These analysis results provide the informationfor determining which method is employed under different

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    5/13

    SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 877

    Fig. 4. Frame error rate for (a) 4-PSK, (b) 8-PSK, and (c) 16-QAMmodulations.

    circumstances. The results reveal that the frame-size thresholdincreases along with the increase in the modulation order,

    while the threshold remains fixed for a specific modulationorder under different MIMO configurations. Therefore, the

    proper branch metric calculation method (Method I or II) canbe employed for different frame sizes and modulation orders.

    Fig. 5. Computational complexity of 16-QAM modulations.

    Fig. 6. Complexity versus frame size for a 4 2 4 MIMO.

    Fig. 7. TBlock diagram for the STTC decoder.

    Fig. 8. Branch metric generator architecture.

    III. ARCHITECTURE AND CIRCUIT DESIGN

    The STTC Viterbi decoder, as well as the convolutional code

    Viterbi decoder, consists of three basic blocks, the branch metric

    generator, the add-compare selector, and the path&metric up-

    dater for final decoding, as shown in Fig. 7. The decoder sup-

    ports 4-PSK/8-PSK/16-QAM modulations with 4/8/16, 8/16,and 16 states, respectively, and provides two operation modes,

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    6/13

    878 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

    Fig. 9. Faded symbol generator.

    Mode I and Mode II, for the respective methods, Method I and

    Method II, in the proposed branch metric calculation.

    A. Branch Metric Generator

    The branch metric generator (BMG), as shown in Fig. 8, is

    composed of the faded symbol generators (FSGs), precomputa-

    tion and decision blocks (PDBs), and branch metric calculators

    (BMCs). In Mode I, the received symbol and the faded symbol

    are used to calculate the branch metrics. In Mode II, the

    PDBs compute the bisector equations between any two nodes

    of the faded candidate symbols and detects the winner node forthe branch metric calculation in Method II. Then, the BMCs use

    the faded symbols to compute the branch metrics.

    1) Faded Symbol Generators: Four faded symbol genera-

    tors are used in the BMG, and each FSG computes the faded

    candidate symbols using the shift-and-add operations, as

    shown in Fig. 9. Then, the faded symbols are stored in memory.

    Because, at most, sixteen states are used in the STTC specifi-

    cation, sixteen sets of real and imaginary parts are required for

    one FSG. In Mode I, these faded symbols are directly deliv-

    ered to the branch metric calculator, while, in Mode II, they are

    used to precompute the distances and the bisectors. Six bits are

    used for the received signal and channel response, thus the total

    memory size in the four FSGs is 4 16 . Because

    four candidate symbols are computed or read by the four FSGs,

    the faded symbol computation can be carried out in one cycle

    for the 4-PSK modulation, two cycles for the 8-PSK modula-

    tion, and four cycles for the 16-QAM modulation.

    2) Pre-Computation and Decision Block: In the initial sym-

    bols of a frame, the PDBs compute the necessary bisector equa-

    tions for binary-tree comparison and then stores these equa-

    tions in memory. If the required bisector equations have been

    stored in memory, the results are just read without requiring any

    computation. Therefore, the PDBs do not compute any bisector

    equation once all of the bisectors have been stored. Slope , in-

    tercept , and scaled constant can be computed for each per-pendicular bisector as follows:

    (6)

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    7/13

    SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 879

    Fig. 10. Pre-computation and decision block.

    and stored in the PDB memory if the bisector has never been cal-

    culated. Two multipliers for computing are shared in order to

    compute and , and thusthe PDB requiresone additional

    cycle at the beginning of the frame, as shown in Fig. 10. How-

    ever, can be read directly if it has been computed and stored in

    memory, and therefore only and have to be computed.

    The PDB needs only one cycle once all the bisectors are com-

    puted and stored. Six PDBs are utilized in the BMG, and each

    PDB contains a memory block with 16 for,at most, sixteen states. Therefore, six bisectors can be generated

    for binary-tree comparison in one cycle.

    3) Branch Metric Calculator: In Mode I, four BMCs di-

    rectly compute the Euclidean distances, as shown in Fig. 11. In

    Mode II, the BMC computes the branch metric and stores it in

    memory only when the required branch metric has not been pre-

    viously computed. Once all the branch metrics have been com-

    puted and stored in memory, the branch metrics are simply read

    from memory, and more computation power can be saved ac-

    cordingly.

    4) Computation Timing Schedule: The BMG is composed of

    four FSGs, six PDBs, and four BMCs. Fig. 12 shows the Mode

    II BMG timing schedules for the 4-PSK and 8-PSK modula-tions. For the 4-PSK modulation, the BMG generates the met-

    rics for one received symbol in four cycles at the beginning of

    a frame, and in three cycles for the following symbols. Note

    that since six PDBs compute all the bisector equations in the

    first symbol, the remaining symbols in the frame require three

    cycles for a single symbol. On the other hand, the PDBs only

    compute the bisectors necessary for binary-tree comparison in

    one received symbol and store them in memory. Thus, four cy-

    cles are required at the beginning of a frame for a single symbol.

    Once all the bisectors are saved in memory, three cycles are re-quired for the remaining symbols. For the 8-PSK modulation,

    as depicted in Fig. 12(b), the initial received symbol requires

    nine cycles to complete the branch metric calculation, while the

    remaining symbols require seven cycles. In order to process the

    symbols for the 16-QAM modulation, seventeen cycles are re-

    quired at the beginning, while thirteen cycles are required for

    the remaining symbols.

    B. Add-Compare Selector

    In the ACS block, branch metrics are accumulated in the

    path metrics for determining the decoding path in the trellis di-

    agram, as shown in Fig. 13. Because the 4-input binary com-

    parator requires a longer computation time and a higher cost, asimple 4-input minimum comparator architecture is proposed,

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    8/13

    880 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

    Fig. 11. The branch metric calculator.

    Fig. 12. BMG timing schedules for (a) 4-PSK and (b) 8-PSK.

    Fig. 13. ACS block.

    as shown in Fig. 14. The n-bit comparison unit in this com-

    parator encodes the branch metrics using an array of bitwise-

    ORs, as shown in Fig. 15. The ones counter then counts the

    number of ones in the sequence. It is assumed that there are

    -bit ones, and is less than eight. The th-bit of each input

    sequence is chosen as the output sequence. On the other hand,if is equal to eight, the sixth bit of each metric is chosen

    as the output sequence. The 4-bit sequence is compared with

    Table II to determine the minimum metric. To demonstrate the

    advantages of the 4-input minimum comparator, the proposed

    and tree-based comparators were designed and simulated using

    0.18 m CMOS technology. The proposed comparator occu-

    pies 1806.23 m with a critical path delay of 4.75 ns, while

    the tree-based comparator occupies 2837.42 m with a critical

    path delay of 8.79 ns. The proposed minimum comparator may

    cause errors in some cases; however, these errors cause negli-

    gible effect on the decoder performance, which is shown in the

    following fixed-point simulation results. Moreover, a simple,

    but reliable, normalization architecture is proposed in order toprevent data overflow as shown in Fig. 16. When all the path

    metrics are larger than 11b00001000000, this constant is sub-

    tracted from each branch metric. Fig. 17 shows the ACS opera-

    tion schedules for the 4-PSK and 8-PSK modulations. Compar-

    ison for the 4-PSK modulation can be performed in one cycle

    using the 4-input minimum comparator. However, an additional

    2-input comparison is required for the 8-PSK modulation in an

    additional cycle, as shown in Fig. 17. Thus, three clock cycles

    are required in order to determine the minimum branch metric

    for each state in the 8-PSK modulation, and five cycles are re-

    quired to perform five 4-input comparisons for the 16-QAM

    modulation.

    C. Path&Metric Updater

    After the minimum metric and its path indexes are de-

    termined, they are stored in memory, and the final results

    are decoded using the trace-back algorithm [31], [32]. The

    path&metric updater is shown in Fig. 18. Eight 4-to-1 multi-

    plexers perform the trace-back for the 4-state trellis, while the

    four 8-to-1 multiplexers and two 16-to-1 multiplexers are used

    for the 8-state and 16-state trellis, respectively. In the proposed

    design, the fixed-state trace-back algorithm, in which state 0 is

    regarded as the starting point for tracing back in the decoding

    window, is chosen. Thus, the hardware costs can be reduced

    significantly, while the performance degradation between thefixed-state and soft-state trace-back algorithm is negligible.

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    9/13

    SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 881

    Fig. 14. 4-input comparator.

    Fig. 15. n -bit comparison unit.

    TABLE IILOOK-UP THE MINIMUM DECISION BLOCK

    D. Fixed-Point Simulation Results

    In this work, efforts were dedicated to reducing the com-

    plexities of branch metric calculation, which implies that the

    performance is sensitive to the precision of branch metrics.

    Therefore, fixed-point simulations are performed before thechip implementation in order to secure slight degradation.

    Fig. 16. Nrmalization block.

    Fig. 17. ACS block timing schedules for 4-PSK and 8-PSK.

    Fig. 19 shows that the performance of the fixed-point Viterbi

    decoder using a fixed-state trace-back algorithm approximates

    that of the floating-point Viterbi decoder using a soft-state

    trace-back algorithm. The word-length of each signal was

    determined through fixed-point simulations according to the

    frame error rate (FER) metric. The decoding window size was

    also determined through fixed-point simulations, as shown in

    Fig. 20. The results show that the minimum truncation window

    size for 4-PSK is about the constraint length, and

    the window size for 8-PSK and 16-QAM is the

    constraint length. The decreasing window size with the increase

    in modulation order is mainly due to the fact that the higherdiversity in the STTC trellis causes a faster convergence in the

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    10/13

    882 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

    Fig. 18. Path&metric updater for the m -bit modulation.

    Fig. 19. Fixed-point simulation results for the proposed STTC decoder.

    Fig. 20. Simulation of different truncation window sizes for the proposedSTTC decoder.

    decoding path. Therefore, 4 16 32 bits of memory are used

    to store 4 branch paths, 16 states, and 33 truncation windowsizes.

    Fig. 21. Photo of the STTC decoder chip.

    TABLE IIICHIP SPECIFICATION FOR THE STTC DECODER

    TABLE IVPOWER CONSUMPTIONS FOR DIFFERENT MODULATIONS AND STATE-SIZE

    UNDER MODE I AND MODE II

    IV. CHIP IMPLEMENTATION

    The proposed STTC decoder was fabricated using TSMC

    0.18 m one-poly six-metal CMOS technology (see Fig. 21).

    The chip occupies about 4.35 mm with a core area of 1.62mmand includes a range of memory capacities as listed in Table III.

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    11/13

    SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 883

    TABLE VCOMPARISON TABLE

    The STTC decoder chip supports 4-PSK, 8-PSK, and 16-QAM

    modulations and 2 1, 3 1, and 4 1 antenna configurations.

    128 frames of test patterns are generated for each modulation

    type, and each frame comprises 130 data symbols impaired

    by fading channel and AWGN noise defined in Section II-A2.

    These patterns are then used to test the STTC decoder chip in

    Modes I and II via the Agilent 93000 digital test station. The

    detailed measurement results are listed in Table III.

    The clock rate of the chip achieves 22.28 MHz, and the

    throughput ranges from 2.785 Mbps to 11.14 Mbps for the

    4-PSK modulation, from 4.18 to 8.36 Mbps for the 8-PSK

    modulation, and 5.57 Mbps for the 16-QAM modulation. The

    power consumption ranges from 0.43 mW to 2.45 mW for

    different numbers of states and modulation orders, as listed in

    Table IV. It can be seen that the power consumption increasesalong with the modulation order and the number of states. As

    the modulation order is higher than 8-PSK, or the number of

    states is larger than sixteen, the chip reaches the upper power

    consumption limits of about 2.4 mW for Mode I and about

    2.0 mW for Mode II because the branch metric generator and

    path&metric updater were designed for the 8-PSK and 16-state

    scheme. Although the STTC Viterbi chip was only designed for

    the 8-PSK and 16-state scheme because of the rapid growth in

    the numbers of the FSG, PDB, and BMC blocks for 16-QAM

    modulation, the chip can still perform the 16-QAM STTC de-

    coding at the sacrifice of an increase in the processing latency.

    Note that Mode II (for Method II) reduces the power con-sumption of Mode I (for Method I) by , and the

    cost overhead for Mode II lies in the PDBs that account for

    only 9.38% of the total cell area. The great power reduction is

    achieved by the fact that, in Mode II for the slow fading channel,

    the branch metrics are computed using the PDBs and are stored

    in memory. After most of the branch metrics are prestored in

    memory, the high branch metric computation power is replaced

    by the lower memory access power.

    Although no STTC Viterbi decoder has previously been re-

    ported in the literature, the chip is still compared with other rel-

    atively state-of-the-art designs, such as the CC Viterbi decoder

    [15], [18], [19] and TCM Viterbi decoders [22][24], as listed

    in Table V. Because Viterbi decoder performance has a closerelationship with the computational complexity of the trellis di-

    agram of the Viterbi decoder, a trellis work-load factor (TWLF)

    is defined to indicate the computational complexity per trellis

    stage as follows:

    where is the arithmetic computational cost for branch metric

    computation; is the number of modulated bits; is the state

    number; and is the antenna dimensions. and

    both equal 1 for the CC and TCM decoders.

    The Viterbi decoders in [22] and [23] both use the hamming

    distance to calculate the branch metrics, thus the is only

    one XOR-gate-count. The for the CC Viterbi decoder in

    [15], [18], and [19] represents the gate-count of one subtractor

    and one squarer, which is used to compute the Euclidean dis-tances with different word-lengths in a single dimension. The

    TCM Viterbi decoder in [24] employs the simplified Euclidean

    distance to determined the branch metrics in two dimensions,

    and thus the is the gate-count of two subtractors without

    any squarer. The for the proposed STTC Viterbi decoder

    corresponds to the complex-valued channel-fading multiplica-

    tions and Euclidean distance computation [see (1)]. Gate-counts

    for XOR, subtractor, squarer, and complex-valued multiplier are

    estimated by synthesizing these operators using a standard cell

    library. The values in Table V are normalized based on the

    design in [23], which is one XOR gate.

    Note that, although the proposed chip has lower throughputthan other decoders, the TWLF for the STTC Viterbi decoder

    is much larger than those of other Viterbi decoders because

    the STTC Viterbi decoder must perform much more complex

    branch-metric computation which increases along with the

    antenna dimension and modulation order. To identify the pro-

    cessing capabilities of Viterbi decoders for different types of

    trellis codes, we normalize the chip throughput by multiplying

    it with the TWLF. The normalized throughout, as shown in the

    bracket of Table V, can be regarded as the trellis processing

    capability per second for the Viterbi decoder. If we further

    consider the power consumption, we can see that the power

    efficiency (throughput per mW) of the proposed STTC Viterbi

    decoder outperforms most other Viterbi decoders except for[24]. Note that these comparison values still depend on the

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    12/13

    884 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

    TABLE VIGENERATOR COEFFICIENT SETS OF THE STTC ENCODER FOR THE SLOW

    FADING CHANNEL

    fabrication technology. If the proposed decoder is implemented

    using an advanced technology, its (normalized) throughput and

    power efficiency can be further improved.

    V. CONCLUSION

    In this paper, a low-complexity STTC decoding algorithm

    and its associated hardware architecture are proposed. Method

    I (Mode I) is designed for fast fading channels and Method II

    (Mode II) is designed for slow fading channels. The complexity

    analysis provides the necessary information of the proper

    method (mode) to be employed under different configurations.

    The STTC decoder is implemented using 0.18 m 1P6M

    CMOS technology and supports 4 1, 3 1, and 2 1 config-

    urations for 4/8/16-state 4-PSK, 8/16-state 8-PSK, and 16-state16-QAM STTC schemes. Moreover, two modes are offered

    TABLE VIIGENERATOR COEFFICIENT SETS OF THE STTC ENCODER FOR THE FAST

    FADING CHANNEL

    for decoding the received symbols under different fading

    channel conditions. This chip yields a maximum throughput of

    11.14 Mbps at a power consumption of 0.43 mW. In conclu-sion, an STTC decoder was realized in a silicon chip, which

    the authors believe can improve the reliability of future coded

    MIMO communication systems.

    APPENDIX

    GENERATOR COEFFICIENT SETS

    See Tables VI and VII.

    ACKNOWLEDGMENT

    The authors would like to thank Chip Implementation Center

    (CIC) of the National Science Council in Taiwan for technical

    support. They also thank the anonymous reviewers for theirvaluable suggestions that greatly improved this paper.

    REFERENCES

    [1] Information Technology-Telecommunications and InformationExchange Between Systems-Local and Metropolitan Area Net-works-Specific Requirements-Part 11: Wireless LAN Medium AccessControl (MAC) and Physical Layer (PHY) Specifications: Amendment4: Enhancements for Higher Throughput 2008, IEEE UnapprovedDraft Standard 802.11n, 4.00.

    [2] Local and Metropolitan AreaNetworks Part 16: Air Interface for Fixedand Mobile Broadband Wireless Access Systems Amendment 2: Phys-ical and Medium Access Control Layers for Combined Fixed and Mo-

    bile Operation in Licensed Bands and Corrigendum 1, , 2006, IEEEStandard 802.16.

    [3] J. Winters,J. Salz, andR. D. Gitlin, The impactof antenna diversity onthe capacity of wireless communication systems, IEEE Transactionson Communications, vol. 42, pp. 17401751, Feb. 1994.

    Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.

  • 8/6/2019 A Low-Complexity Viterbi Decoder

    13/13

    SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 885

    [4] Y. Jung, J. Kim, S. Noh, H. Yoon, and J. Kim, A digital 120 Mb/sMIMO-OFDM baseband processor for high speed wireless LANs, inProc. IEEE CICC05, Sep. 2005, pp. 8184.

    [5] T. Chen, Z. Yu, Y. Peng, Y. Zhang, H. Dai, and X. Liu, A MIMOreceiver SOC for CDMA applications, in Proc. IEEE InternationalSOC Conference, Sep. 2006, pp. 275278.

    [6] Y. Jung, J. Kim, S. Lee, H. Yoon, and J. Kim, Design and implemen-tation of MIMO-OFDM baseband processor for high-speed wireless

    LANs, IEEE Transactions on Circuits and SystemsPart II: ExpressBriefs, vol. 54, pp. 631635, Jul. 2007.[7] B. Vucetic and J. Yuan, Space-Time Coding. : John Wiley and Sons,

    2003.[8] D. Bevan and R. Tanner, Performance comparison of space-time

    coding techniques, IEE Electronics Letters, vol. 35, pp. 17071708,Sep. 1999.

    [9] S. M. Alamouti, A simple transmit diversity technique for wirelesscommunications, IEEE Journal on Selected Areas in Communica-tions, vol. 16, pp. 14511458, Oct. 1998.

    [10] V. Tarokh, N. Seshadri, and A. R. Calderbank, Space-time codes forhighdata rate wireless communication: Performance criterionand codeconstruction, IEEE Transactions on Information Theory, vol. 44, pp.744765, Mar. 1998.

    [11] A. Naguib, V. Tarokh, N. Seshadri, and A. Calderbank, A space-timecoding modem for high-data-rate wireless communications, IEEE

    Journal on Selected Areas in Communications, vol. 16,pp. 14511458,

    Oct. 1998.[12] V. Tarokh, A. Naguib, N. Seshadri, and A. R. Calderbank, Combined

    array processing and space-time coding, IEEE Transactions on Infor-mation Theory, vol. 45, pp. 11211128, May 1999.

    [13] E. Cavus and B. Daneshrad, A very low-complexity space-time blockdecoder (STBD) ASIC for wireless systems, IEEE Transactions onCircuits and SystemsPart I: Regular Papers, vol. 53, pp. 6069, Jan.2006.

    [14] P. J. Black and T. H. Meng, A 140-Mb/s, 32-state, radix-4 Viterbidecoder,IEEEJournal of Solid-State Circuits, vol. 27,pp. 18771885,Dec. 1992.

    [15] E. Yeo, S. A. Augsburger, W. R. Davis, and B. Nikolic, A 500-Mb/ssoft-output Viterbi decoder, IEEE Journal of Solid-State Circuits, vol.38, pp. 12341241, Jul. 2003.

    [16] C. Cheng and K. K. Parhi, Hardware efficient low-latency architec-ture for high throughput rate Viterbi decoders, IEEE Transactions onCircuits and SystemsPart II: Express Briefs, vol. 55, pp. 12541258,Dec. 2008.

    [17] M. D. Shieh, T. P. Wang, and D. W. Yang, Low-power register-ex-change survivor memory architectures for Viterbi decoders, IET Cir-cuits, Devices, & Systems, vol. 3, pp. 8390, Apr. 2009.

    [18] F. Sun and T. Zhang, Parallel high-throughput limited search trellisdecoder VLSI design, IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems, vol. 13, pp. 10131022, Sep. 2005.

    [19] M. A. Anders, S. K. Mathew, S. K. Hsu, R. K. Krishnamurthy, andS. Borkar, A 1.9 Gb/s 358 mW 16256 state reconfigurable Viterbiaccelerator in 90 nm CMOS, IEEE Journal of Solid-State Circuits,vol. 43, pp. 214222, Jan. 2008.

    [20] C. C. Lin, Y. H. Shih, H. C. Chang, and C. Y. Lee, A low powerturbo/Viterbi decoder for 3GPP2 applications, IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 14, pp. 426430,Apr. 2006.

    [21] R. Tessier, S. Swaminathan, R. Ramaswamy, D. Goeckel, and W.

    Burleson, A reconfigurable, power-efficient adaptive Viterbi de-coder, IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 13, pp. 484488, Apr. 2005.

    [22] E. F. Haratsch and K. Azadet, A low complexity joint equalizer anddecoder for 1000Base-T Gigabit Ethernet, in Proc. IEEE CICC00,May 2000, pp. 465468.

    [23] A. Dinh and X. Hu, A hardware-efficient technique to implement atrellis code modulation decoder, IEEE Transactions on Very LargeScale Integration (VLSI) Systems, vol. 13, pp. 745750, Jun. 2005.

    [24] M. Kamuf, V. Owall, and J. B. Anderson, Optimization and imple-mentation of a Viterbi decoder under flexibility constraints, IEEETransactions on Circuits and SystemsPart I: Fundamental Theory

    and Applications, vol. 55, no. 8, pp. 24112422, Sep. 2008.

    [25] S. Nandula, Y. S. Rao, and S. P. Embanath, High speed area efficientconfigurable Viterbi decoder for WiFi and WiMAX systems, in Proc.

    IEEE ICIAS07, Nov. 2007, pp. 13961399.[26] R. Manzoor, A. Rafique, and K. B. Bajwa, Hardware implementation

    of pragmatic trellis coded modulationappliedto 8PSK and16QAM forDVB standard, in Proc. IEEE CNSDSP08, Jul. 2008, pp. 363367.

    [27] H. Lee and M. P. Fitz, Systematic expansion of full diversity space-time multiple tcm codes for two transmit antennas,IEEE Transactions

    on Wireless Communications, vol. 7, pp. 20272032, Jun. 2008.[28] T. M. H. Ngo, G. Zaharia, S. Bougeard, and J. F. Helard, Design ofbalanced QPSK space-time trellis codes for several transmit antennas,in Proc. IEEE ISSCS07, Jul. 2007, vol. 2, pp. 14.

    [29] D. Kim and H.-W. Choi, Advanced constant multiplier for multi-path pipelined FFT processor, IET Electronics Letters, vol. 44, pp.518519, Apr. 2008.

    [30] H. T. Nguyen and A. Chattejee, Number-splitting with shift-and-adddecompositionfor power and hardware optimization in linear DSP syn-thesis,IEEE Transactions on Very Large ScaleIntegration (VLSI) Sys-tems, vol. 8, pp. 419424, Aug. 2000.

    [31] O. Collins and F. Pollara, Memory Management in Traceback ViterbiDecoders Jet Propulsion Laboratory, 1988, TDA Progress Report42-99.

    [32] T. K. Truong, M. T. Shih, I. S. Reed, and E. H. Satorius, A VLSIdesign for a trace-back Viterbi decoder, IEEE Transactions on Com-munications, vol. 40, pp. 616624, Mar. 1992.

    Kai-Ting Shr (S08) was born in Taiwan in 1983.He received the B.S. degree in electrical engineeringfrom National Tsing-Hua University (NTHU),Hsinchu, Taiwan, in 2005, where he is currentlyworking toward the Ph.D. degree.

    Hisresearchinterests include VLSI designand im-plementation of the low-power and high-throughputcommunication system.

    Hong-Du Chen was born Taiwan in 1982. Hereceived the B.S. degree in electronic engineeringfrom Chang-Gung University, Taoyuan, Taiwan,in 2004 and the M.S. degree in communicationsengineering from National Tsing-Hua University,Hsinchu, Taiwan, in 2007.

    He is currently in military service in Taiwan. Hisresearch interests includes VLSI design and imple-mentation of the communication applications.

    Yuan-Hao Huang (S98M02) was born in Taiwanin 1973. He received the B.S. and Ph.D. degrees inelectrical engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 1995 and 2001, respectively.

    He was a Member of Technical Staff with VXISTechnology Corporation, Hsin-Chu, Taiwan from2001 and 2005. Since 2005, he has been withthe Deparment of Electrical Engineering and theInstitute of Communications Engineering, NationalTsing-Hua University, Taiwan, where he is currentlyan Assistant Professor. His research interests include

    VLSI design for digital signal processing systems and telecommunicationsystems.