A Low-Complexity Viterbi Decoder

8/6/2019 A Low-Complexity Viterbi Decoder

1/13

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010 873

A Low-Complexity Viterbi Decoder forSpace-Time Trellis Codes

Kai-Ting Shr, Student Member, IEEE, Hong-Du Chen, and Yuan-Hao Huang, Member, IEEE

AbstractSpace-time trellis code (STTC) has been widelyapplied to coded multiple-input multiple-output (MIMO) systemsbecause of its gains in coding and diversity; however, its greatdecoding complexity makes it less promising in chip realizationcompared to the space-time block code (STBC). The complexityof STTC decoding lies in the branch metric calculation in theViterbi algorithm and increases significantly along with thenumber of antennas and the modulation order. Consequently, alow-complexity algorithm to mitigate the computational burden isproposed. The results show that more than 70%, 78%, and 83%of the computational complexity is reduced for 2 2, 3 3, and4 4 MIMO configurations, respectively. Based on the proposedalgorithm, a reconfigurable MISO STTC Viterbi decoder is de-

signed and implemented using 0.18 m 1P6M CMOS technology.The decoder achieves 11.14 Mbps, 8.36 Mbps, and 5.75 Mbps for4-PSK, 8-PSK, and 16-QAM modulations, respectively.

Index TermsBranch metrics, MIMO, space-time trellis code,Viterbi decoder.

I. INTRODUCTION

IN RECENT years, multiple-input multiple-output (MIMO)transmission technology has been widely applied to var-

ious wireless communication systems, such as IEEE 802.11n

WiFi [1] and IEEE 802.16e WiMAX [2]. MIMO technology isdivided into two categories, spatial multiplexing and diversity

coding [3]. In the spatial multiplexing technique, the data is splitinto multiple streams, which are transmitted and received by

multiple antennas. Subsequently, the receiver detects the trans-mitted symbols from the signals received by the multiple re-ceiving antennas. This kind of MIMO technique can increase

channel capacity and data rate by using more transmitting an-tennas [4][6]. In addition, the diversity coding technique has abetter capability of resisting the channel impairment. The mostpopular diversity coding technique is space-time coding (STC)which involves space diversity, modulation, and error correction

[7]. TheSTC can moderately improve the spectral efficiency andprovide coding gains for error correction [8]. Alamouti [9] pro-posed a two-transmit-antenna scheme, which is then extended

to space-time block codes (STBC) [7] and has been adopted in

Manuscript receivedJanuary 19, 2009; revised April30, 2009. FirstpublishedDecember 22, 2009; current version published April 09, 2010. This work wassupported in part by the National Science Council, Taiwan, under Grant NSC96-2219-E-007-013 and Grant NSC 96-2220-E-007-014. This paper was rec-ommended by Associate Editor A. Strollo.

K.-T. Shr is with the Department of Eelectrical Engineering, NationalTsing-Hua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).

H.-D. Chen was with the Institute of Communications Engineering, NationalTsing-Hua University, Hsinchu 30013, Taiwan. He is currently in military ser-vice in Taiwan.

Y.-H. Huang is with the Institute of Communications Engineering and De-partment of Electrical Engineering, National Tsing-Hua University, Hsinchu30013, Taiwan (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSI.2009.2027648

wireless standards [1], [2], to enhance the performance of wire-less communication systems. However, it only improves the di-versity gain rather than the channel capacity of the system. Thespace-time trellis code (STTC) [10][12] was also proposed to

improve both the diversity gain and coding gain for wirelesscommunication systems. The STTC encoder generates redun-dant parity check codes which are transmitted with the originalinformation data streams, and thereby coding gain is obtained.

Therefore, the STTC technique possesses a more robust capa-bility than the STBC technique for combating severe MIMOchannel impairment. However, the STTC decoder employs the

Viterbi algorithm which requires much greater decoding com-plexity than the STBC decoder [13], thus the STTC is seldomadopted as the diversity coding technique in current wireless

communication systems.

In the literature, many Viterbi decoder chips have been pub-

lished for convolutional codes (CC) and trellis-coded modu-

lation (TCM) techniques. A 32-state radix-4 Viterbi decoder

was first proposed and implemented in [14], and many suc-

ceeding works have proposed methods of improving the per-

formance of the CC Viterbi coder. Some approaches enhance

the throughput performance [15] or reduce the power consump-

tion using specific techniques, such as look-ahead calculation

[16] or improved register-exchange architectures [17]. SeveralCC Viterbi decoders aim to support the larger number of states

[18], [19] or different decoding algorithms [20] in certain com-

munication standards. In [21], an adaptive Viterbi decoder can

be reconfigured dynamically in response to the varying channel

conditions so as to reduce power consumption. On the other

hand, Viterbi decoders for the trellis-coded modulation (TCM)

technique, which involves error-correcting coding and modula-

tion, have been designed and implemented in recent years. The

TCM Viterbi decoder requires a greater decoding complexity

than the CC Viterbi decoder, thus more efforts must be made

to reduce hardware costs [22], [23]. An ASIC chip [24] has

been proposed to support different communication specifica-

tions under a variety of channel conditions. Moreover, TCM

Viterbi decoders for WiFi and DVB have been proposed in [25]

and [26], respectively.Compared to the traditional Viterbi decoder for convolutional

codes and TCM codes, the STTC Viterbi decoder has to perform

a great amount of complex-value multiplications for branch-metric computation, which is proportional to the modulationorder and the number of antennas. Efficient STTC decoders withseveral antennas for the MIMO/MISO systems were proposed

in [27], [28]; however, the STTC decoder architecture is seldomaddressed or implemented.

To overcome this implementation bottleneck, an efficient

tree-search algorithm and a constant multiplier architecture

1549-8328/$26.00 2010 IEEE

Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.


2/13

874 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010

Fig. 1. Space-time trellis-coded MIMO system.

[29], [30] is proposed in order to reduce the complexity ofbranch metric calculation in the STTC Viterbi decoder. The

complexity can be reduced to such a degree that the cost of im-plementation is reasonable compared with the available STBCdecoder [13]. Then, the STTC Viterbi decoder is implementedbased on the proposed algorithm using 0.18 m 1P6M CMOS

technology. To the authors knowledge, this is the first STTCdecoder chip to be proposed and implemented.

This paper is organized as follows. The proposed STTC de-coding algorithm is introduced in Section II. The STTC decoderarchitecture based on the proposed algorithm is then presentedin Section III. The chip implementation and measurement re-sults are demonstrated and discussed in Section IV. Finally, a

conclusion to the work is given in Section V.

II. PROPOSED STTC DECODING ALGORITHM

A. Introduction of Space-Time Trellis Codes

The space-time trellis coded MIMO system, as depictedin Fig. 1, involves modulation, error correction, and diversity

techniques. bits of data are encodedin the transmitter at time . Then, the STTC-encoded symbols

are transmitted by transmittingantennas. In the receiver, receiving antennas acquire thesignal , which is then decoded by the

STTC decoder.1) STTC Encoder: For -PSK or -QAM modulation,

the STTC encoder receives input bit sequences, as shown in

Fig. 2. Each bit sequence is buffered with delay elementsand encoded with a set of generator coefficients similar tothe convolutional code for the -th transmitting antenna, where

for the -th bit sequence and

. Then, the encoded symbols can be expressed by

The generator coefficients are properly designed using the tracecriterion [7] according to the channel properties and are listedin the Appendix .

2) MIMO Channel Model: In this paper, we assume that theencoded symbols are impaired by the Rayleigh fading channel,

and the received signal at time is modeled by

where is the channel gain matrix and is thenoise vector consisting of all white Gaussian variables. Each

Fig. 2. The STTC encoder.

element in is an independent and identical complex Gaussianrandom variable with zero mean and unity variance.

3) STTC Decoder: The STTC decoder uses the Viterbi algo-rithm to detect the coded multiple streams that are transmittedthrough the MIMO channel. Note that the STTC trellis per-tains to the modulation order in the state diagram, and there

are transition paths from the previous states for -PSK or-QAM. Therefore, the modulation order has a great impact

on the computational complexity of the branch metric calcula-tion and the add-compare-select (ACS) computation.

Branch Metric Calculation: In the STTC Viterbi decoder,the branch metric is defined as the Euclidean distance betweenthe actually received symbol and the faded candidate symbol

corresponding to each trellis transition as follows,

(1)

where is the received symbol in the -th receiving antenna attime ; is the channel response between the -th receiving

antenna and the -th transmitting antenna at time ; and is the

candidate symbol from the -th transmitting antenna at time .The branch metric for each trellis path is accumulated and com-

pared to obtain the optimal candidate symbol. It can be seen thatthe computational complexity of the branch metric calculation



3/13

SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 875

increases along with the number of antennas and the modulation

order.Add-Compare-Select: The add-compare-select block

compares the accumulated path metrics and determines the

optimal path, which is recorded and updated in the memory.The STTC Viterbi decoder requires -input ACS operations

for -PSK or -QAM modulation. The expanding modu-lation order greatly increases the ACS hardware costs andcomputation delay time and thus limits the throughput in theViterbi decoder.

In summary, the number of transmitting and receiving an-tennas and the modulation order determine the computationalcomplexity of the STTC decoder. In the following, branch

metric calculation methods and a new ACS architecture forreducing the computational complexity is proposed.

B. Proposed Branch Metric Calculation Methods

Two methods for branch metric calculation are proposed.Method I performs the branch metric calculation when the

channel matrix is updated for each signal vector in the fastfading channel, and Method II computes the branch metrics ifthe channel remains fixed during a frame period in the slowfading channel.

1) Method I: Method I aims to simplify the computa-

tional complexity for the branch metrics because a greatamount of the complexity lies in the complex-value mul-tiplication . Method I separates the complex-value

multiplication into two real-value multiplica-

tion equations, and

. Because the values of

and are fixed for a specific candidate symbol, we caneasily calculate the product with simple shift-and-add opera-

tions. For example, in the 4-PSK STTC code in Appendix ,is 1, , or for two transmitters. Its trellis diagram and I/Qmapping of the faded symbols are depicted in Fig. 3(a) and (b).Then, faded candidate symbols , , , and can be

simplified and listed in Table I. Thus, the complex-value com-putation is reduced to the real-value addition of and

. Similarly, this method can be applied to 8-PSK and16-QAM STTC codes, and then only constant multiplication isrequired in order to compute the faded symbols.

2) Method II: In the slow Rayleigh fading channel, thevalues of the channel response is constant for a specific timeinterval; therefore, the value of remains fixed in the time

interval which is defined as a frame in the generator coefficientdetermination [7].

At the beginning of each frame, the received symbol andthe channel-impaired candidates ( ) are mapped

onto the I/Q plane. Then, the nearest node ( ) to in thebinary-tree structure is determined, as shown in Fig. 3(c). Theperpendicular bisector used to detect the winner of each binarycomparison between each pair of nodes and can be ex-

pressed as follows:

(2)

Fig. 3. (a) Trellis diagram. (b) I/Q mapping of the candidate symbols for the4-PSK modulation. (c) Binary-tree comparison.

TABLE ICOORDINATES OF NODES FOR 4-PSK

After the winner is determined, Method II replaces the valueof in(1) with the valueof the nearest candidate , thatis, the

distance between and ( , ) is takenastheactual branchmetric between and (

, ). Once the distance between any two nodes and thebisector equations are precomputed in the first symbol of each

frame, only the distance between the nearest candidate andneeds to be computed for the remaining received symbols in theframe, and thus the computational complexity can be reduced.

Method II is used to compute the branch metrics for an ex-ample of the 4-state 4-PSK STTC with 2Tx and 1Rx, whose

trellis transition is shown in Fig. 3(a). In the original branchmetric calculation, branch metrics , , , and are

calculated using the following equations:

(3)



4/13


In Method II, the received and the candidate symbols

are mapped on the I/Q plane as and re-spectively, as shown in Fig. 3(b). The faded candidate symbols

are computed in the same way as Method I.To determine the nearest faded symbol node to , perpendicularbisectors , , and are derived as follows:

(4)

Then , the nearest faded node to , can be obtained usingthe tree-search comparison, and , , and are replaced

with , , and , respectively, as follows.

(5)

The coordinates of the faded candidate symbols, the distancesbetween any two nodes, and all perpendicular bisectors are pre-

computed at the beginning of a frame and saved in memory untilthe end of the frame. Therefore, these precomputed distancescan be used without the need for additional branch metric cal-culations for all other loser nodes in the tree-search comparison.

Therefore, if the frame is large, a great amount of complexity

can be reduced.The idea of precomputing slow-changing or common infor-

mation is popular in the digital signal processing algorithm forcommunication. Typical examples are channel-estimation for

OFDM system and preprocessing for the MIMO detector, suchas QR decomposition and lattice reduction. People usually per-form these operations block-wise, i.e., all computations in these

functions are performed once and afterward their results are uti-lized for a frame. In the proposed Method II, the branch metriccalculation block itself is not slow-changing or common duringa frame because it must compute branch metrics for successive

input signals. We just try to use the nearest node (fixed for aframe) to replace the received signal (variant for a frame) so that

the branch metric computation can be approximated by usingthe fixed node-distances which are precomputed in the initial

symbols of a frame. In the implementation of the algorithm and

the hardware architecture, the branch metric calculation blockpartially generates the results with the precomputed constantnode-distances and partially computes the branch metric of the

winner node for the sucessive input signals during a frame. Thisis the main difference from other similar design concept in the

digital signal processing for communication.

C. Performance Analysis

The STTC Viterbi decoder using the proposed branch metriccalculation methods was simulated and analyzed. The framesize is assumed to be 130 received symbols for slow fadingSTTC codes. The frame error rate (FER) performances for dif-

ferent modulation schemes are shown in Fig. 4. It can been seenthat Method I does not degrade performance because only thesimplification of the complex-value multiplication is applied.Moreover, if Rayleigh fading channels are slow, Method II can

further reduce the computational complexity at cost of degrada-tion in coding gain.

D. Complexity Analysis

The computational complexities of the proposed branchmetric calculation methods are analyzed for the differentMIMO configurations. It is assumed that the channel response

remains fixed in a frame, which contains 130 symbols in theslow fading channel and only one symbol in the fast fadingchannel. Fig. 5 depicts the computational complexities for

16-QAM under different MIMO configurations. Note thatMethod I is equivalent to the traditional Viterbi algorithm;thus, the total number of operations is not reduced. However,the constant portion represents the simplified multiplicationsand occupies almost 50% of the total number of computations.

Because the constant multiplications can be implemented usingshift-and-add operations, cost and power can be greatly reducedin the VLSI implementation. If the fading channel is slow, bothmethods can reduce the majority of constant operations because

the faded candidate symbols are computed only once ina single frame. Furthermore, more operations can be reducedin Method II because the computations of the branch metrics

for the loser nodes are replaced by accessing the precomputeddistances in memory. It can seen that the computational com-plexities can be reduced by 70%, 78%, and 83% for 2 2,3 3, and 4 4 MIMO configurations. Note that the computa-

tional complexities can be further reduced by more than 90% in

the 8 8 MIMO using Method II in the slow fading channel.The computational complexities of the proposed methods

have a close relationship with the modulation order and theframe size, as shown in Fig. 6. For Method I, the metric values

are computed for each received symbol. Thus, the computa-tional complexity tends to converge as the frame size increases.For Method II, the precomputation of the distances between

any two nodes and the bisector equations requires a largeamount of complexity in the initially received symbols in aframe. However, the computational complexities are averagedand become lower as the frame size increases because the

latter distance results are accessed from the memory withoutrequiring any computation. Therefore, more computation cost

can be reduced. These analysis results provide the informationfor determining which method is employed under different



5/13


Fig. 4. Frame error rate for (a) 4-PSK, (b) 8-PSK, and (c) 16-QAMmodulations.

circumstances. The results reveal that the frame-size thresholdincreases along with the increase in the modulation order,

while the threshold remains fixed for a specific modulationorder under different MIMO configurations. Therefore, the

proper branch metric calculation method (Method I or II) canbe employed for different frame sizes and modulation orders.

Fig. 5. Computational complexity of 16-QAM modulations.

Fig. 6. Complexity versus frame size for a 4 2 4 MIMO.

Fig. 7. TBlock diagram for the STTC decoder.

Fig. 8. Branch metric generator architecture.

III. ARCHITECTURE AND CIRCUIT DESIGN

The STTC Viterbi decoder, as well as the convolutional code

Viterbi decoder, consists of three basic blocks, the branch metric

generator, the add-compare selector, and the path&metric up-

dater for final decoding, as shown in Fig. 7. The decoder sup-

ports 4-PSK/8-PSK/16-QAM modulations with 4/8/16, 8/16,and 16 states, respectively, and provides two operation modes,



6/13


Fig. 9. Faded symbol generator.

Mode I and Mode II, for the respective methods, Method I and

Method II, in the proposed branch metric calculation.

A. Branch Metric Generator

The branch metric generator (BMG), as shown in Fig. 8, is

composed of the faded symbol generators (FSGs), precomputa-

tion and decision blocks (PDBs), and branch metric calculators

(BMCs). In Mode I, the received symbol and the faded symbol

are used to calculate the branch metrics. In Mode II, the

PDBs compute the bisector equations between any two nodes

of the faded candidate symbols and detects the winner node forthe branch metric calculation in Method II. Then, the BMCs use

the faded symbols to compute the branch metrics.

1) Faded Symbol Generators: Four faded symbol genera-

tors are used in the BMG, and each FSG computes the faded

candidate symbols using the shift-and-add operations, as

shown in Fig. 9. Then, the faded symbols are stored in memory.

Because, at most, sixteen states are used in the STTC specifi-

cation, sixteen sets of real and imaginary parts are required for

one FSG. In Mode I, these faded symbols are directly deliv-

ered to the branch metric calculator, while, in Mode II, they are

used to precompute the distances and the bisectors. Six bits are

used for the received signal and channel response, thus the total

memory size in the four FSGs is 4 16 . Because

four candidate symbols are computed or read by the four FSGs,

the faded symbol computation can be carried out in one cycle

for the 4-PSK modulation, two cycles for the 8-PSK modula-

tion, and four cycles for the 16-QAM modulation.

2) Pre-Computation and Decision Block: In the initial sym-

bols of a frame, the PDBs compute the necessary bisector equa-

tions for binary-tree comparison and then stores these equa-

tions in memory. If the required bisector equations have been

stored in memory, the results are just read without requiring any

computation. Therefore, the PDBs do not compute any bisector

equation once all of the bisectors have been stored. Slope , in-

tercept , and scaled constant can be computed for each per-pendicular bisector as follows:

(6)



7/13


Fig. 10. Pre-computation and decision block.

and stored in the PDB memory if the bisector has never been cal-

culated. Two multipliers for computing are shared in order to

compute and , and thusthe PDB requiresone additional

cycle at the beginning of the frame, as shown in Fig. 10. How-

ever, can be read directly if it has been computed and stored in

memory, and therefore only and have to be computed.

The PDB needs only one cycle once all the bisectors are com-

puted and stored. Six PDBs are utilized in the BMG, and each

PDB contains a memory block with 16 for,at most, sixteen states. Therefore, six bisectors can be generated

for binary-tree comparison in one cycle.

3) Branch Metric Calculator: In Mode I, four BMCs di-

rectly compute the Euclidean distances, as shown in Fig. 11. In

Mode II, the BMC computes the branch metric and stores it in

memory only when the required branch metric has not been pre-

viously computed. Once all the branch metrics have been com-

puted and stored in memory, the branch metrics are simply read

from memory, and more computation power can be saved ac-

cordingly.

4) Computation Timing Schedule: The BMG is composed of

four FSGs, six PDBs, and four BMCs. Fig. 12 shows the Mode

II BMG timing schedules for the 4-PSK and 8-PSK modula-tions. For the 4-PSK modulation, the BMG generates the met-

rics for one received symbol in four cycles at the beginning of

a frame, and in three cycles for the following symbols. Note

that since six PDBs compute all the bisector equations in the

first symbol, the remaining symbols in the frame require three

cycles for a single symbol. On the other hand, the PDBs only

compute the bisectors necessary for binary-tree comparison in

one received symbol and store them in memory. Thus, four cy-

cles are required at the beginning of a frame for a single symbol.

Once all the bisectors are saved in memory, three cycles are re-quired for the remaining symbols. For the 8-PSK modulation,

as depicted in Fig. 12(b), the initial received symbol requires

nine cycles to complete the branch metric calculation, while the

remaining symbols require seven cycles. In order to process the

symbols for the 16-QAM modulation, seventeen cycles are re-

quired at the beginning, while thirteen cycles are required for

the remaining symbols.

B. Add-Compare Selector

In the ACS block, branch metrics are accumulated in the

path metrics for determining the decoding path in the trellis di-

agram, as shown in Fig. 13. Because the 4-input binary com-

parator requires a longer computation time and a higher cost, asimple 4-input minimum comparator architecture is proposed,



8/13


Fig. 11. The branch metric calculator.

Fig. 12. BMG timing schedules for (a) 4-PSK and (b) 8-PSK.

Fig. 13. ACS block.

as shown in Fig. 14. The n-bit comparison unit in this com-

parator encodes the branch metrics using an array of bitwise-

ORs, as shown in Fig. 15. The ones counter then counts the

number of ones in the sequence. It is assumed that there are

-bit ones, and is less than eight. The th-bit of each input

sequence is chosen as the output sequence. On the other hand,if is equal to eight, the sixth bit of each metric is chosen

as the output sequence. The 4-bit sequence is compared with

Table II to determine the minimum metric. To demonstrate the

advantages of the 4-input minimum comparator, the proposed

and tree-based comparators were designed and simulated using

0.18 m CMOS technology. The proposed comparator occu-

pies 1806.23 m with a critical path delay of 4.75 ns, while

the tree-based comparator occupies 2837.42 m with a critical

path delay of 8.79 ns. The proposed minimum comparator may

cause errors in some cases; however, these errors cause negli-

gible effect on the decoder performance, which is shown in the

following fixed-point simulation results. Moreover, a simple,

but reliable, normalization architecture is proposed in order toprevent data overflow as shown in Fig. 16. When all the path

metrics are larger than 11b00001000000, this constant is sub-

tracted from each branch metric. Fig. 17 shows the ACS opera-

tion schedules for the 4-PSK and 8-PSK modulations. Compar-

ison for the 4-PSK modulation can be performed in one cycle

using the 4-input minimum comparator. However, an additional

2-input comparison is required for the 8-PSK modulation in an

additional cycle, as shown in Fig. 17. Thus, three clock cycles

are required in order to determine the minimum branch metric

for each state in the 8-PSK modulation, and five cycles are re-

quired to perform five 4-input comparisons for the 16-QAM

modulation.

C. Path&Metric Updater

After the minimum metric and its path indexes are de-

termined, they are stored in memory, and the final results

are decoded using the trace-back algorithm [31], [32]. The

path&metric updater is shown in Fig. 18. Eight 4-to-1 multi-

plexers perform the trace-back for the 4-state trellis, while the

four 8-to-1 multiplexers and two 16-to-1 multiplexers are used

for the 8-state and 16-state trellis, respectively. In the proposed

design, the fixed-state trace-back algorithm, in which state 0 is

regarded as the starting point for tracing back in the decoding

window, is chosen. Thus, the hardware costs can be reduced

significantly, while the performance degradation between thefixed-state and soft-state trace-back algorithm is negligible.



9/13


Fig. 14. 4-input comparator.

Fig. 15. n -bit comparison unit.

TABLE IILOOK-UP THE MINIMUM DECISION BLOCK

D. Fixed-Point Simulation Results

In this work, efforts were dedicated to reducing the com-

plexities of branch metric calculation, which implies that the

performance is sensitive to the precision of branch metrics.

Therefore, fixed-point simulations are performed before thechip implementation in order to secure slight degradation.

Fig. 16. Nrmalization block.

Fig. 17. ACS block timing schedules for 4-PSK and 8-PSK.

Fig. 19 shows that the performance of the fixed-point Viterbi

decoder using a fixed-state trace-back algorithm approximates

that of the floating-point Viterbi decoder using a soft-state

trace-back algorithm. The word-length of each signal was

determined through fixed-point simulations according to the

frame error rate (FER) metric. The decoding window size was

also determined through fixed-point simulations, as shown in

Fig. 20. The results show that the minimum truncation window

size for 4-PSK is about the constraint length, and

the window size for 8-PSK and 16-QAM is the

constraint length. The decreasing window size with the increase

in modulation order is mainly due to the fact that the higherdiversity in the STTC trellis causes a faster convergence in the



10/13


Fig. 18. Path&metric updater for the m -bit modulation.

Fig. 19. Fixed-point simulation results for the proposed STTC decoder.

Fig. 20. Simulation of different truncation window sizes for the proposedSTTC decoder.

decoding path. Therefore, 4 16 32 bits of memory are used

to store 4 branch paths, 16 states, and 33 truncation windowsizes.

Fig. 21. Photo of the STTC decoder chip.

TABLE IIICHIP SPECIFICATION FOR THE STTC DECODER

TABLE IVPOWER CONSUMPTIONS FOR DIFFERENT MODULATIONS AND STATE-SIZE

UNDER MODE I AND MODE II

IV. CHIP IMPLEMENTATION

The proposed STTC decoder was fabricated using TSMC

0.18 m one-poly six-metal CMOS technology (see Fig. 21).

The chip occupies about 4.35 mm with a core area of 1.62mmand includes a range of memory capacities as listed in Table III.



11/13


TABLE VCOMPARISON TABLE

The STTC decoder chip supports 4-PSK, 8-PSK, and 16-QAM

modulations and 2 1, 3 1, and 4 1 antenna configurations.

128 frames of test patterns are generated for each modulation

type, and each frame comprises 130 data symbols impaired

by fading channel and AWGN noise defined in Section II-A2.

These patterns are then used to test the STTC decoder chip in

Modes I and II via the Agilent 93000 digital test station. The

detailed measurement results are listed in Table III.

The clock rate of the chip achieves 22.28 MHz, and the

throughput ranges from 2.785 Mbps to 11.14 Mbps for the

4-PSK modulation, from 4.18 to 8.36 Mbps for the 8-PSK

modulation, and 5.57 Mbps for the 16-QAM modulation. The

power consumption ranges from 0.43 mW to 2.45 mW for

different numbers of states and modulation orders, as listed in

Table IV. It can be seen that the power consumption increasesalong with the modulation order and the number of states. As

the modulation order is higher than 8-PSK, or the number of

states is larger than sixteen, the chip reaches the upper power

consumption limits of about 2.4 mW for Mode I and about

2.0 mW for Mode II because the branch metric generator and

path&metric updater were designed for the 8-PSK and 16-state

scheme. Although the STTC Viterbi chip was only designed for

the 8-PSK and 16-state scheme because of the rapid growth in

the numbers of the FSG, PDB, and BMC blocks for 16-QAM

modulation, the chip can still perform the 16-QAM STTC de-

coding at the sacrifice of an increase in the processing latency.

Note that Mode II (for Method II) reduces the power con-sumption of Mode I (for Method I) by , and the

cost overhead for Mode II lies in the PDBs that account for

only 9.38% of the total cell area. The great power reduction is

achieved by the fact that, in Mode II for the slow fading channel,

the branch metrics are computed using the PDBs and are stored

in memory. After most of the branch metrics are prestored in

memory, the high branch metric computation power is replaced

by the lower memory access power.

Although no STTC Viterbi decoder has previously been re-

ported in the literature, the chip is still compared with other rel-

atively state-of-the-art designs, such as the CC Viterbi decoder

[15], [18], [19] and TCM Viterbi decoders [22][24], as listed

in Table V. Because Viterbi decoder performance has a closerelationship with the computational complexity of the trellis di-

agram of the Viterbi decoder, a trellis work-load factor (TWLF)

is defined to indicate the computational complexity per trellis

stage as follows:

where is the arithmetic computational cost for branch metric

computation; is the number of modulated bits; is the state

number; and is the antenna dimensions. and

both equal 1 for the CC and TCM decoders.

The Viterbi decoders in [22] and [23] both use the hamming

distance to calculate the branch metrics, thus the is only

one XOR-gate-count. The for the CC Viterbi decoder in

[15], [18], and [19] represents the gate-count of one subtractor

and one squarer, which is used to compute the Euclidean dis-tances with different word-lengths in a single dimension. The

TCM Viterbi decoder in [24] employs the simplified Euclidean

distance to determined the branch metrics in two dimensions,

and thus the is the gate-count of two subtractors without

any squarer. The for the proposed STTC Viterbi decoder

corresponds to the complex-valued channel-fading multiplica-

tions and Euclidean distance computation [see (1)]. Gate-counts

for XOR, subtractor, squarer, and complex-valued multiplier are

estimated by synthesizing these operators using a standard cell

library. The values in Table V are normalized based on the

design in [23], which is one XOR gate.

Note that, although the proposed chip has lower throughputthan other decoders, the TWLF for the STTC Viterbi decoder

is much larger than those of other Viterbi decoders because

the STTC Viterbi decoder must perform much more complex

branch-metric computation which increases along with the

antenna dimension and modulation order. To identify the pro-

cessing capabilities of Viterbi decoders for different types of

trellis codes, we normalize the chip throughput by multiplying

it with the TWLF. The normalized throughout, as shown in the

bracket of Table V, can be regarded as the trellis processing

capability per second for the Viterbi decoder. If we further

consider the power consumption, we can see that the power

efficiency (throughput per mW) of the proposed STTC Viterbi

decoder outperforms most other Viterbi decoders except for[24]. Note that these comparison values still depend on the



12/13


TABLE VIGENERATOR COEFFICIENT SETS OF THE STTC ENCODER FOR THE SLOW

FADING CHANNEL

fabrication technology. If the proposed decoder is implemented

using an advanced technology, its (normalized) throughput and

power efficiency can be further improved.

V. CONCLUSION

In this paper, a low-complexity STTC decoding algorithm

and its associated hardware architecture are proposed. Method

I (Mode I) is designed for fast fading channels and Method II

(Mode II) is designed for slow fading channels. The complexity

analysis provides the necessary information of the proper

method (mode) to be employed under different configurations.

The STTC decoder is implemented using 0.18 m 1P6M

CMOS technology and supports 4 1, 3 1, and 2 1 config-

urations for 4/8/16-state 4-PSK, 8/16-state 8-PSK, and 16-state16-QAM STTC schemes. Moreover, two modes are offered

TABLE VIIGENERATOR COEFFICIENT SETS OF THE STTC ENCODER FOR THE FAST

FADING CHANNEL

for decoding the received symbols under different fading

channel conditions. This chip yields a maximum throughput of

11.14 Mbps at a power consumption of 0.43 mW. In conclu-sion, an STTC decoder was realized in a silicon chip, which

the authors believe can improve the reliability of future coded

MIMO communication systems.

APPENDIX

GENERATOR COEFFICIENT SETS

See Tables VI and VII.

ACKNOWLEDGMENT

The authors would like to thank Chip Implementation Center

(CIC) of the National Science Council in Taiwan for technical

support. They also thank the anonymous reviewers for theirvaluable suggestions that greatly improved this paper.

REFERENCES

[1] Information Technology-Telecommunications and InformationExchange Between Systems-Local and Metropolitan Area Net-works-Specific Requirements-Part 11: Wireless LAN Medium AccessControl (MAC) and Physical Layer (PHY) Specifications: Amendment4: Enhancements for Higher Throughput 2008, IEEE UnapprovedDraft Standard 802.11n, 4.00.

[2] Local and Metropolitan AreaNetworks Part 16: Air Interface for Fixedand Mobile Broadband Wireless Access Systems Amendment 2: Phys-ical and Medium Access Control Layers for Combined Fixed and Mo-

bile Operation in Licensed Bands and Corrigendum 1, , 2006, IEEEStandard 802.16.

[3] J. Winters,J. Salz, andR. D. Gitlin, The impactof antenna diversity onthe capacity of wireless communication systems, IEEE Transactionson Communications, vol. 42, pp. 17401751, Feb. 1994.



13/13


[4] Y. Jung, J. Kim, S. Noh, H. Yoon, and J. Kim, A digital 120 Mb/sMIMO-OFDM baseband processor for high speed wireless LANs, inProc. IEEE CICC05, Sep. 2005, pp. 8184.

[5] T. Chen, Z. Yu, Y. Peng, Y. Zhang, H. Dai, and X. Liu, A MIMOreceiver SOC for CDMA applications, in Proc. IEEE InternationalSOC Conference, Sep. 2006, pp. 275278.

[6] Y. Jung, J. Kim, S. Lee, H. Yoon, and J. Kim, Design and implemen-tation of MIMO-OFDM baseband processor for high-speed wireless

LANs, IEEE Transactions on Circuits and SystemsPart II: ExpressBriefs, vol. 54, pp. 631635, Jul. 2007.[7] B. Vucetic and J. Yuan, Space-Time Coding. : John Wiley and Sons,

2003.[8] D. Bevan and R. Tanner, Performance comparison of space-time

coding techniques, IEE Electronics Letters, vol. 35, pp. 17071708,Sep. 1999.

[9] S. M. Alamouti, A simple transmit diversity technique for wirelesscommunications, IEEE Journal on Selected Areas in Communica-tions, vol. 16, pp. 14511458, Oct. 1998.

[10] V. Tarokh, N. Seshadri, and A. R. Calderbank, Space-time codes forhighdata rate wireless communication: Performance criterionand codeconstruction, IEEE Transactions on Information Theory, vol. 44, pp.744765, Mar. 1998.

[11] A. Naguib, V. Tarokh, N. Seshadri, and A. Calderbank, A space-timecoding modem for high-data-rate wireless communications, IEEE

Journal on Selected Areas in Communications, vol. 16,pp. 14511458,

Oct. 1998.[12] V. Tarokh, A. Naguib, N. Seshadri, and A. R. Calderbank, Combined

array processing and space-time coding, IEEE Transactions on Infor-mation Theory, vol. 45, pp. 11211128, May 1999.

[13] E. Cavus and B. Daneshrad, A very low-complexity space-time blockdecoder (STBD) ASIC for wireless systems, IEEE Transactions onCircuits and SystemsPart I: Regular Papers, vol. 53, pp. 6069, Jan.2006.

[14] P. J. Black and T. H. Meng, A 140-Mb/s, 32-state, radix-4 Viterbidecoder,IEEEJournal of Solid-State Circuits, vol. 27,pp. 18771885,Dec. 1992.

[15] E. Yeo, S. A. Augsburger, W. R. Davis, and B. Nikolic, A 500-Mb/ssoft-output Viterbi decoder, IEEE Journal of Solid-State Circuits, vol.38, pp. 12341241, Jul. 2003.

[16] C. Cheng and K. K. Parhi, Hardware efficient low-latency architec-ture for high throughput rate Viterbi decoders, IEEE Transactions onCircuits and SystemsPart II: Express Briefs, vol. 55, pp. 12541258,Dec. 2008.

[17] M. D. Shieh, T. P. Wang, and D. W. Yang, Low-power register-ex-change survivor memory architectures for Viterbi decoders, IET Cir-cuits, Devices, & Systems, vol. 3, pp. 8390, Apr. 2009.

[18] F. Sun and T. Zhang, Parallel high-throughput limited search trellisdecoder VLSI design, IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems, vol. 13, pp. 10131022, Sep. 2005.

[19] M. A. Anders, S. K. Mathew, S. K. Hsu, R. K. Krishnamurthy, andS. Borkar, A 1.9 Gb/s 358 mW 16256 state reconfigurable Viterbiaccelerator in 90 nm CMOS, IEEE Journal of Solid-State Circuits,vol. 43, pp. 214222, Jan. 2008.

[20] C. C. Lin, Y. H. Shih, H. C. Chang, and C. Y. Lee, A low powerturbo/Viterbi decoder for 3GPP2 applications, IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 14, pp. 426430,Apr. 2006.

[21] R. Tessier, S. Swaminathan, R. Ramaswamy, D. Goeckel, and W.

Burleson, A reconfigurable, power-efficient adaptive Viterbi de-coder, IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 13, pp. 484488, Apr. 2005.

[22] E. F. Haratsch and K. Azadet, A low complexity joint equalizer anddecoder for 1000Base-T Gigabit Ethernet, in Proc. IEEE CICC00,May 2000, pp. 465468.

[23] A. Dinh and X. Hu, A hardware-efficient technique to implement atrellis code modulation decoder, IEEE Transactions on Very LargeScale Integration (VLSI) Systems, vol. 13, pp. 745750, Jun. 2005.

[24] M. Kamuf, V. Owall, and J. B. Anderson, Optimization and imple-mentation of a Viterbi decoder under flexibility constraints, IEEETransactions on Circuits and SystemsPart I: Fundamental Theory

and Applications, vol. 55, no. 8, pp. 24112422, Sep. 2008.

[25] S. Nandula, Y. S. Rao, and S. P. Embanath, High speed area efficientconfigurable Viterbi decoder for WiFi and WiMAX systems, in Proc.

IEEE ICIAS07, Nov. 2007, pp. 13961399.[26] R. Manzoor, A. Rafique, and K. B. Bajwa, Hardware implementation

of pragmatic trellis coded modulationappliedto 8PSK and16QAM forDVB standard, in Proc. IEEE CNSDSP08, Jul. 2008, pp. 363367.

[27] H. Lee and M. P. Fitz, Systematic expansion of full diversity space-time multiple tcm codes for two transmit antennas,IEEE Transactions

on Wireless Communications, vol. 7, pp. 20272032, Jun. 2008.[28] T. M. H. Ngo, G. Zaharia, S. Bougeard, and J. F. Helard, Design ofbalanced QPSK space-time trellis codes for several transmit antennas,in Proc. IEEE ISSCS07, Jul. 2007, vol. 2, pp. 14.

[29] D. Kim and H.-W. Choi, Advanced constant multiplier for multi-path pipelined FFT processor, IET Electronics Letters, vol. 44, pp.518519, Apr. 2008.

[30] H. T. Nguyen and A. Chattejee, Number-splitting with shift-and-adddecompositionfor power and hardware optimization in linear DSP syn-thesis,IEEE Transactions on Very Large ScaleIntegration (VLSI) Sys-tems, vol. 8, pp. 419424, Aug. 2000.

[31] O. Collins and F. Pollara, Memory Management in Traceback ViterbiDecoders Jet Propulsion Laboratory, 1988, TDA Progress Report42-99.

[32] T. K. Truong, M. T. Shih, I. S. Reed, and E. H. Satorius, A VLSIdesign for a trace-back Viterbi decoder, IEEE Transactions on Com-munications, vol. 40, pp. 616624, Mar. 1992.

Kai-Ting Shr (S08) was born in Taiwan in 1983.He received the B.S. degree in electrical engineeringfrom National Tsing-Hua University (NTHU),Hsinchu, Taiwan, in 2005, where he is currentlyworking toward the Ph.D. degree.

Hisresearchinterests include VLSI designand im-plementation of the low-power and high-throughputcommunication system.

Hong-Du Chen was born Taiwan in 1982. Hereceived the B.S. degree in electronic engineeringfrom Chang-Gung University, Taoyuan, Taiwan,in 2004 and the M.S. degree in communicationsengineering from National Tsing-Hua University,Hsinchu, Taiwan, in 2007.

He is currently in military service in Taiwan. Hisresearch interests includes VLSI design and imple-mentation of the communication applications.

Yuan-Hao Huang (S98M02) was born in Taiwanin 1973. He received the B.S. and Ph.D. degrees inelectrical engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 1995 and 2001, respectively.

He was a Member of Technical Staff with VXISTechnology Corporation, Hsin-Chu, Taiwan from2001 and 2005. Since 2005, he has been withthe Deparment of Electrical Engineering and theInstitute of Communications Engineering, NationalTsing-Hua University, Taiwan, where he is currentlyan Assistant Professor. His research interests include

VLSI design for digital signal processing systems and telecommunicationsystems.

Documents

A Low-Complexity Viterbi Decoder