8/6/2019 A Low-Complexity Viterbi Decoder
1/13
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010 873
A Low-Complexity Viterbi Decoder forSpace-Time Trellis Codes
Kai-Ting Shr, Student Member, IEEE, Hong-Du Chen, and Yuan-Hao Huang, Member, IEEE
AbstractSpace-time trellis code (STTC) has been widelyapplied to coded multiple-input multiple-output (MIMO) systemsbecause of its gains in coding and diversity; however, its greatdecoding complexity makes it less promising in chip realizationcompared to the space-time block code (STBC). The complexityof STTC decoding lies in the branch metric calculation in theViterbi algorithm and increases significantly along with thenumber of antennas and the modulation order. Consequently, alow-complexity algorithm to mitigate the computational burden isproposed. The results show that more than 70%, 78%, and 83%of the computational complexity is reduced for 2 2, 3 3, and4 4 MIMO configurations, respectively. Based on the proposedalgorithm, a reconfigurable MISO STTC Viterbi decoder is de-
signed and implemented using 0.18 m 1P6M CMOS technology.The decoder achieves 11.14 Mbps, 8.36 Mbps, and 5.75 Mbps for4-PSK, 8-PSK, and 16-QAM modulations, respectively.
Index TermsBranch metrics, MIMO, space-time trellis code,Viterbi decoder.
I. INTRODUCTION
IN RECENT years, multiple-input multiple-output (MIMO)transmission technology has been widely applied to var-
ious wireless communication systems, such as IEEE 802.11n
WiFi [1] and IEEE 802.16e WiMAX [2]. MIMO technology isdivided into two categories, spatial multiplexing and diversity
coding [3]. In the spatial multiplexing technique, the data is splitinto multiple streams, which are transmitted and received by
multiple antennas. Subsequently, the receiver detects the trans-mitted symbols from the signals received by the multiple re-ceiving antennas. This kind of MIMO technique can increase
channel capacity and data rate by using more transmitting an-tennas [4][6]. In addition, the diversity coding technique has abetter capability of resisting the channel impairment. The mostpopular diversity coding technique is space-time coding (STC)which involves space diversity, modulation, and error correction
[7]. TheSTC can moderately improve the spectral efficiency andprovide coding gains for error correction [8]. Alamouti [9] pro-posed a two-transmit-antenna scheme, which is then extended
to space-time block codes (STBC) [7] and has been adopted in
Manuscript receivedJanuary 19, 2009; revised April30, 2009. FirstpublishedDecember 22, 2009; current version published April 09, 2010. This work wassupported in part by the National Science Council, Taiwan, under Grant NSC96-2219-E-007-013 and Grant NSC 96-2220-E-007-014. This paper was rec-ommended by Associate Editor A. Strollo.
K.-T. Shr is with the Department of Eelectrical Engineering, NationalTsing-Hua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).
H.-D. Chen was with the Institute of Communications Engineering, NationalTsing-Hua University, Hsinchu 30013, Taiwan. He is currently in military ser-vice in Taiwan.
Y.-H. Huang is with the Institute of Communications Engineering and De-partment of Electrical Engineering, National Tsing-Hua University, Hsinchu30013, Taiwan (e-mail: [email protected]).
Digital Object Identifier 10.1109/TCSI.2009.2027648
wireless standards [1], [2], to enhance the performance of wire-less communication systems. However, it only improves the di-versity gain rather than the channel capacity of the system. Thespace-time trellis code (STTC) [10][12] was also proposed to
improve both the diversity gain and coding gain for wirelesscommunication systems. The STTC encoder generates redun-dant parity check codes which are transmitted with the originalinformation data streams, and thereby coding gain is obtained.
Therefore, the STTC technique possesses a more robust capa-bility than the STBC technique for combating severe MIMOchannel impairment. However, the STTC decoder employs the
Viterbi algorithm which requires much greater decoding com-plexity than the STBC decoder [13], thus the STTC is seldomadopted as the diversity coding technique in current wireless
communication systems.
In the literature, many Viterbi decoder chips have been pub-
lished for convolutional codes (CC) and trellis-coded modu-
lation (TCM) techniques. A 32-state radix-4 Viterbi decoder
was first proposed and implemented in [14], and many suc-
ceeding works have proposed methods of improving the per-
formance of the CC Viterbi coder. Some approaches enhance
the throughput performance [15] or reduce the power consump-
tion using specific techniques, such as look-ahead calculation
[16] or improved register-exchange architectures [17]. SeveralCC Viterbi decoders aim to support the larger number of states
[18], [19] or different decoding algorithms [20] in certain com-
munication standards. In [21], an adaptive Viterbi decoder can
be reconfigured dynamically in response to the varying channel
conditions so as to reduce power consumption. On the other
hand, Viterbi decoders for the trellis-coded modulation (TCM)
technique, which involves error-correcting coding and modula-
tion, have been designed and implemented in recent years. The
TCM Viterbi decoder requires a greater decoding complexity
than the CC Viterbi decoder, thus more efforts must be made
to reduce hardware costs [22], [23]. An ASIC chip [24] has
been proposed to support different communication specifica-
tions under a variety of channel conditions. Moreover, TCM
Viterbi decoders for WiFi and DVB have been proposed in [25]
and [26], respectively.Compared to the traditional Viterbi decoder for convolutional
codes and TCM codes, the STTC Viterbi decoder has to perform
a great amount of complex-value multiplications for branch-metric computation, which is proportional to the modulationorder and the number of antennas. Efficient STTC decoders withseveral antennas for the MIMO/MISO systems were proposed
in [27], [28]; however, the STTC decoder architecture is seldomaddressed or implemented.
To overcome this implementation bottleneck, an efficient
tree-search algorithm and a constant multiplier architecture
1549-8328/$26.00 2010 IEEE
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
2/13
874 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010
Fig. 1. Space-time trellis-coded MIMO system.
[29], [30] is proposed in order to reduce the complexity ofbranch metric calculation in the STTC Viterbi decoder. The
complexity can be reduced to such a degree that the cost of im-plementation is reasonable compared with the available STBCdecoder [13]. Then, the STTC Viterbi decoder is implementedbased on the proposed algorithm using 0.18 m 1P6M CMOS
technology. To the authors knowledge, this is the first STTCdecoder chip to be proposed and implemented.
This paper is organized as follows. The proposed STTC de-coding algorithm is introduced in Section II. The STTC decoderarchitecture based on the proposed algorithm is then presentedin Section III. The chip implementation and measurement re-sults are demonstrated and discussed in Section IV. Finally, a
conclusion to the work is given in Section V.
II. PROPOSED STTC DECODING ALGORITHM
A. Introduction of Space-Time Trellis Codes
The space-time trellis coded MIMO system, as depictedin Fig. 1, involves modulation, error correction, and diversity
techniques. bits of data are encodedin the transmitter at time . Then, the STTC-encoded symbols
are transmitted by transmittingantennas. In the receiver, receiving antennas acquire thesignal , which is then decoded by the
STTC decoder.1) STTC Encoder: For -PSK or -QAM modulation,
the STTC encoder receives input bit sequences, as shown in
Fig. 2. Each bit sequence is buffered with delay elementsand encoded with a set of generator coefficients similar tothe convolutional code for the -th transmitting antenna, where
for the -th bit sequence and
. Then, the encoded symbols can be expressed by
The generator coefficients are properly designed using the tracecriterion [7] according to the channel properties and are listedin the Appendix .
2) MIMO Channel Model: In this paper, we assume that theencoded symbols are impaired by the Rayleigh fading channel,
and the received signal at time is modeled by
where is the channel gain matrix and is thenoise vector consisting of all white Gaussian variables. Each
Fig. 2. The STTC encoder.
element in is an independent and identical complex Gaussianrandom variable with zero mean and unity variance.
3) STTC Decoder: The STTC decoder uses the Viterbi algo-rithm to detect the coded multiple streams that are transmittedthrough the MIMO channel. Note that the STTC trellis per-tains to the modulation order in the state diagram, and there
are transition paths from the previous states for -PSK or-QAM. Therefore, the modulation order has a great impact
on the computational complexity of the branch metric calcula-tion and the add-compare-select (ACS) computation.
Branch Metric Calculation: In the STTC Viterbi decoder,the branch metric is defined as the Euclidean distance betweenthe actually received symbol and the faded candidate symbol
corresponding to each trellis transition as follows,
(1)
where is the received symbol in the -th receiving antenna attime ; is the channel response between the -th receiving
antenna and the -th transmitting antenna at time ; and is the
candidate symbol from the -th transmitting antenna at time .The branch metric for each trellis path is accumulated and com-
pared to obtain the optimal candidate symbol. It can be seen thatthe computational complexity of the branch metric calculation
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
3/13
SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 875
increases along with the number of antennas and the modulation
order.Add-Compare-Select: The add-compare-select block
compares the accumulated path metrics and determines the
optimal path, which is recorded and updated in the memory.The STTC Viterbi decoder requires -input ACS operations
for -PSK or -QAM modulation. The expanding modu-lation order greatly increases the ACS hardware costs andcomputation delay time and thus limits the throughput in theViterbi decoder.
In summary, the number of transmitting and receiving an-tennas and the modulation order determine the computationalcomplexity of the STTC decoder. In the following, branch
metric calculation methods and a new ACS architecture forreducing the computational complexity is proposed.
B. Proposed Branch Metric Calculation Methods
Two methods for branch metric calculation are proposed.Method I performs the branch metric calculation when the
channel matrix is updated for each signal vector in the fastfading channel, and Method II computes the branch metrics ifthe channel remains fixed during a frame period in the slowfading channel.
1) Method I: Method I aims to simplify the computa-
tional complexity for the branch metrics because a greatamount of the complexity lies in the complex-value mul-tiplication . Method I separates the complex-value
multiplication into two real-value multiplica-
tion equations, and
. Because the values of
and are fixed for a specific candidate symbol, we caneasily calculate the product with simple shift-and-add opera-
tions. For example, in the 4-PSK STTC code in Appendix ,is 1, , or for two transmitters. Its trellis diagram and I/Qmapping of the faded symbols are depicted in Fig. 3(a) and (b).Then, faded candidate symbols , , , and can be
simplified and listed in Table I. Thus, the complex-value com-putation is reduced to the real-value addition of and
. Similarly, this method can be applied to 8-PSK and16-QAM STTC codes, and then only constant multiplication isrequired in order to compute the faded symbols.
2) Method II: In the slow Rayleigh fading channel, thevalues of the channel response is constant for a specific timeinterval; therefore, the value of remains fixed in the time
interval which is defined as a frame in the generator coefficientdetermination [7].
At the beginning of each frame, the received symbol andthe channel-impaired candidates ( ) are mapped
onto the I/Q plane. Then, the nearest node ( ) to in thebinary-tree structure is determined, as shown in Fig. 3(c). Theperpendicular bisector used to detect the winner of each binarycomparison between each pair of nodes and can be ex-
pressed as follows:
(2)
Fig. 3. (a) Trellis diagram. (b) I/Q mapping of the candidate symbols for the4-PSK modulation. (c) Binary-tree comparison.
TABLE ICOORDINATES OF NODES FOR 4-PSK
After the winner is determined, Method II replaces the valueof in(1) with the valueof the nearest candidate , thatis, the
distance between and ( , ) is takenastheactual branchmetric between and (
, ). Once the distance between any two nodes and thebisector equations are precomputed in the first symbol of each
frame, only the distance between the nearest candidate andneeds to be computed for the remaining received symbols in theframe, and thus the computational complexity can be reduced.
Method II is used to compute the branch metrics for an ex-ample of the 4-state 4-PSK STTC with 2Tx and 1Rx, whose
trellis transition is shown in Fig. 3(a). In the original branchmetric calculation, branch metrics , , , and are
calculated using the following equations:
(3)
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
4/13
876 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010
In Method II, the received and the candidate symbols
are mapped on the I/Q plane as and re-spectively, as shown in Fig. 3(b). The faded candidate symbols
are computed in the same way as Method I.To determine the nearest faded symbol node to , perpendicularbisectors , , and are derived as follows:
(4)
Then , the nearest faded node to , can be obtained usingthe tree-search comparison, and , , and are replaced
with , , and , respectively, as follows.
(5)
The coordinates of the faded candidate symbols, the distancesbetween any two nodes, and all perpendicular bisectors are pre-
computed at the beginning of a frame and saved in memory untilthe end of the frame. Therefore, these precomputed distancescan be used without the need for additional branch metric cal-culations for all other loser nodes in the tree-search comparison.
Therefore, if the frame is large, a great amount of complexity
can be reduced.The idea of precomputing slow-changing or common infor-
mation is popular in the digital signal processing algorithm forcommunication. Typical examples are channel-estimation for
OFDM system and preprocessing for the MIMO detector, suchas QR decomposition and lattice reduction. People usually per-form these operations block-wise, i.e., all computations in these
functions are performed once and afterward their results are uti-lized for a frame. In the proposed Method II, the branch metriccalculation block itself is not slow-changing or common duringa frame because it must compute branch metrics for successive
input signals. We just try to use the nearest node (fixed for aframe) to replace the received signal (variant for a frame) so that
the branch metric computation can be approximated by usingthe fixed node-distances which are precomputed in the initial
symbols of a frame. In the implementation of the algorithm and
the hardware architecture, the branch metric calculation blockpartially generates the results with the precomputed constantnode-distances and partially computes the branch metric of the
winner node for the sucessive input signals during a frame. Thisis the main difference from other similar design concept in the
digital signal processing for communication.
C. Performance Analysis
The STTC Viterbi decoder using the proposed branch metriccalculation methods was simulated and analyzed. The framesize is assumed to be 130 received symbols for slow fadingSTTC codes. The frame error rate (FER) performances for dif-
ferent modulation schemes are shown in Fig. 4. It can been seenthat Method I does not degrade performance because only thesimplification of the complex-value multiplication is applied.Moreover, if Rayleigh fading channels are slow, Method II can
further reduce the computational complexity at cost of degrada-tion in coding gain.
D. Complexity Analysis
The computational complexities of the proposed branchmetric calculation methods are analyzed for the differentMIMO configurations. It is assumed that the channel response
remains fixed in a frame, which contains 130 symbols in theslow fading channel and only one symbol in the fast fadingchannel. Fig. 5 depicts the computational complexities for
16-QAM under different MIMO configurations. Note thatMethod I is equivalent to the traditional Viterbi algorithm;thus, the total number of operations is not reduced. However,the constant portion represents the simplified multiplicationsand occupies almost 50% of the total number of computations.
Because the constant multiplications can be implemented usingshift-and-add operations, cost and power can be greatly reducedin the VLSI implementation. If the fading channel is slow, bothmethods can reduce the majority of constant operations because
the faded candidate symbols are computed only once ina single frame. Furthermore, more operations can be reducedin Method II because the computations of the branch metrics
for the loser nodes are replaced by accessing the precomputeddistances in memory. It can seen that the computational com-plexities can be reduced by 70%, 78%, and 83% for 2 2,3 3, and 4 4 MIMO configurations. Note that the computa-
tional complexities can be further reduced by more than 90% in
the 8 8 MIMO using Method II in the slow fading channel.The computational complexities of the proposed methods
have a close relationship with the modulation order and theframe size, as shown in Fig. 6. For Method I, the metric values
are computed for each received symbol. Thus, the computa-tional complexity tends to converge as the frame size increases.For Method II, the precomputation of the distances between
any two nodes and the bisector equations requires a largeamount of complexity in the initially received symbols in aframe. However, the computational complexities are averagedand become lower as the frame size increases because the
latter distance results are accessed from the memory withoutrequiring any computation. Therefore, more computation cost
can be reduced. These analysis results provide the informationfor determining which method is employed under different
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
5/13
SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 877
Fig. 4. Frame error rate for (a) 4-PSK, (b) 8-PSK, and (c) 16-QAMmodulations.
circumstances. The results reveal that the frame-size thresholdincreases along with the increase in the modulation order,
while the threshold remains fixed for a specific modulationorder under different MIMO configurations. Therefore, the
proper branch metric calculation method (Method I or II) canbe employed for different frame sizes and modulation orders.
Fig. 5. Computational complexity of 16-QAM modulations.
Fig. 6. Complexity versus frame size for a 4 2 4 MIMO.
Fig. 7. TBlock diagram for the STTC decoder.
Fig. 8. Branch metric generator architecture.
III. ARCHITECTURE AND CIRCUIT DESIGN
The STTC Viterbi decoder, as well as the convolutional code
Viterbi decoder, consists of three basic blocks, the branch metric
generator, the add-compare selector, and the path&metric up-
dater for final decoding, as shown in Fig. 7. The decoder sup-
ports 4-PSK/8-PSK/16-QAM modulations with 4/8/16, 8/16,and 16 states, respectively, and provides two operation modes,
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
6/13
878 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010
Fig. 9. Faded symbol generator.
Mode I and Mode II, for the respective methods, Method I and
Method II, in the proposed branch metric calculation.
A. Branch Metric Generator
The branch metric generator (BMG), as shown in Fig. 8, is
composed of the faded symbol generators (FSGs), precomputa-
tion and decision blocks (PDBs), and branch metric calculators
(BMCs). In Mode I, the received symbol and the faded symbol
are used to calculate the branch metrics. In Mode II, the
PDBs compute the bisector equations between any two nodes
of the faded candidate symbols and detects the winner node forthe branch metric calculation in Method II. Then, the BMCs use
the faded symbols to compute the branch metrics.
1) Faded Symbol Generators: Four faded symbol genera-
tors are used in the BMG, and each FSG computes the faded
candidate symbols using the shift-and-add operations, as
shown in Fig. 9. Then, the faded symbols are stored in memory.
Because, at most, sixteen states are used in the STTC specifi-
cation, sixteen sets of real and imaginary parts are required for
one FSG. In Mode I, these faded symbols are directly deliv-
ered to the branch metric calculator, while, in Mode II, they are
used to precompute the distances and the bisectors. Six bits are
used for the received signal and channel response, thus the total
memory size in the four FSGs is 4 16 . Because
four candidate symbols are computed or read by the four FSGs,
the faded symbol computation can be carried out in one cycle
for the 4-PSK modulation, two cycles for the 8-PSK modula-
tion, and four cycles for the 16-QAM modulation.
2) Pre-Computation and Decision Block: In the initial sym-
bols of a frame, the PDBs compute the necessary bisector equa-
tions for binary-tree comparison and then stores these equa-
tions in memory. If the required bisector equations have been
stored in memory, the results are just read without requiring any
computation. Therefore, the PDBs do not compute any bisector
equation once all of the bisectors have been stored. Slope , in-
tercept , and scaled constant can be computed for each per-pendicular bisector as follows:
(6)
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
7/13
SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 879
Fig. 10. Pre-computation and decision block.
and stored in the PDB memory if the bisector has never been cal-
culated. Two multipliers for computing are shared in order to
compute and , and thusthe PDB requiresone additional
cycle at the beginning of the frame, as shown in Fig. 10. How-
ever, can be read directly if it has been computed and stored in
memory, and therefore only and have to be computed.
The PDB needs only one cycle once all the bisectors are com-
puted and stored. Six PDBs are utilized in the BMG, and each
PDB contains a memory block with 16 for,at most, sixteen states. Therefore, six bisectors can be generated
for binary-tree comparison in one cycle.
3) Branch Metric Calculator: In Mode I, four BMCs di-
rectly compute the Euclidean distances, as shown in Fig. 11. In
Mode II, the BMC computes the branch metric and stores it in
memory only when the required branch metric has not been pre-
viously computed. Once all the branch metrics have been com-
puted and stored in memory, the branch metrics are simply read
from memory, and more computation power can be saved ac-
cordingly.
4) Computation Timing Schedule: The BMG is composed of
four FSGs, six PDBs, and four BMCs. Fig. 12 shows the Mode
II BMG timing schedules for the 4-PSK and 8-PSK modula-tions. For the 4-PSK modulation, the BMG generates the met-
rics for one received symbol in four cycles at the beginning of
a frame, and in three cycles for the following symbols. Note
that since six PDBs compute all the bisector equations in the
first symbol, the remaining symbols in the frame require three
cycles for a single symbol. On the other hand, the PDBs only
compute the bisectors necessary for binary-tree comparison in
one received symbol and store them in memory. Thus, four cy-
cles are required at the beginning of a frame for a single symbol.
Once all the bisectors are saved in memory, three cycles are re-quired for the remaining symbols. For the 8-PSK modulation,
as depicted in Fig. 12(b), the initial received symbol requires
nine cycles to complete the branch metric calculation, while the
remaining symbols require seven cycles. In order to process the
symbols for the 16-QAM modulation, seventeen cycles are re-
quired at the beginning, while thirteen cycles are required for
the remaining symbols.
B. Add-Compare Selector
In the ACS block, branch metrics are accumulated in the
path metrics for determining the decoding path in the trellis di-
agram, as shown in Fig. 13. Because the 4-input binary com-
parator requires a longer computation time and a higher cost, asimple 4-input minimum comparator architecture is proposed,
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
8/13
880 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010
Fig. 11. The branch metric calculator.
Fig. 12. BMG timing schedules for (a) 4-PSK and (b) 8-PSK.
Fig. 13. ACS block.
as shown in Fig. 14. The n-bit comparison unit in this com-
parator encodes the branch metrics using an array of bitwise-
ORs, as shown in Fig. 15. The ones counter then counts the
number of ones in the sequence. It is assumed that there are
-bit ones, and is less than eight. The th-bit of each input
sequence is chosen as the output sequence. On the other hand,if is equal to eight, the sixth bit of each metric is chosen
as the output sequence. The 4-bit sequence is compared with
Table II to determine the minimum metric. To demonstrate the
advantages of the 4-input minimum comparator, the proposed
and tree-based comparators were designed and simulated using
0.18 m CMOS technology. The proposed comparator occu-
pies 1806.23 m with a critical path delay of 4.75 ns, while
the tree-based comparator occupies 2837.42 m with a critical
path delay of 8.79 ns. The proposed minimum comparator may
cause errors in some cases; however, these errors cause negli-
gible effect on the decoder performance, which is shown in the
following fixed-point simulation results. Moreover, a simple,
but reliable, normalization architecture is proposed in order toprevent data overflow as shown in Fig. 16. When all the path
metrics are larger than 11b00001000000, this constant is sub-
tracted from each branch metric. Fig. 17 shows the ACS opera-
tion schedules for the 4-PSK and 8-PSK modulations. Compar-
ison for the 4-PSK modulation can be performed in one cycle
using the 4-input minimum comparator. However, an additional
2-input comparison is required for the 8-PSK modulation in an
additional cycle, as shown in Fig. 17. Thus, three clock cycles
are required in order to determine the minimum branch metric
for each state in the 8-PSK modulation, and five cycles are re-
quired to perform five 4-input comparisons for the 16-QAM
modulation.
C. Path&Metric Updater
After the minimum metric and its path indexes are de-
termined, they are stored in memory, and the final results
are decoded using the trace-back algorithm [31], [32]. The
path&metric updater is shown in Fig. 18. Eight 4-to-1 multi-
plexers perform the trace-back for the 4-state trellis, while the
four 8-to-1 multiplexers and two 16-to-1 multiplexers are used
for the 8-state and 16-state trellis, respectively. In the proposed
design, the fixed-state trace-back algorithm, in which state 0 is
regarded as the starting point for tracing back in the decoding
window, is chosen. Thus, the hardware costs can be reduced
significantly, while the performance degradation between thefixed-state and soft-state trace-back algorithm is negligible.
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
9/13
SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 881
Fig. 14. 4-input comparator.
Fig. 15. n -bit comparison unit.
TABLE IILOOK-UP THE MINIMUM DECISION BLOCK
D. Fixed-Point Simulation Results
In this work, efforts were dedicated to reducing the com-
plexities of branch metric calculation, which implies that the
performance is sensitive to the precision of branch metrics.
Therefore, fixed-point simulations are performed before thechip implementation in order to secure slight degradation.
Fig. 16. Nrmalization block.
Fig. 17. ACS block timing schedules for 4-PSK and 8-PSK.
Fig. 19 shows that the performance of the fixed-point Viterbi
decoder using a fixed-state trace-back algorithm approximates
that of the floating-point Viterbi decoder using a soft-state
trace-back algorithm. The word-length of each signal was
determined through fixed-point simulations according to the
frame error rate (FER) metric. The decoding window size was
also determined through fixed-point simulations, as shown in
Fig. 20. The results show that the minimum truncation window
size for 4-PSK is about the constraint length, and
the window size for 8-PSK and 16-QAM is the
constraint length. The decreasing window size with the increase
in modulation order is mainly due to the fact that the higherdiversity in the STTC trellis causes a faster convergence in the
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
10/13
882 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010
Fig. 18. Path&metric updater for the m -bit modulation.
Fig. 19. Fixed-point simulation results for the proposed STTC decoder.
Fig. 20. Simulation of different truncation window sizes for the proposedSTTC decoder.
decoding path. Therefore, 4 16 32 bits of memory are used
to store 4 branch paths, 16 states, and 33 truncation windowsizes.
Fig. 21. Photo of the STTC decoder chip.
TABLE IIICHIP SPECIFICATION FOR THE STTC DECODER
TABLE IVPOWER CONSUMPTIONS FOR DIFFERENT MODULATIONS AND STATE-SIZE
UNDER MODE I AND MODE II
IV. CHIP IMPLEMENTATION
The proposed STTC decoder was fabricated using TSMC
0.18 m one-poly six-metal CMOS technology (see Fig. 21).
The chip occupies about 4.35 mm with a core area of 1.62mmand includes a range of memory capacities as listed in Table III.
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
11/13
SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 883
TABLE VCOMPARISON TABLE
The STTC decoder chip supports 4-PSK, 8-PSK, and 16-QAM
modulations and 2 1, 3 1, and 4 1 antenna configurations.
128 frames of test patterns are generated for each modulation
type, and each frame comprises 130 data symbols impaired
by fading channel and AWGN noise defined in Section II-A2.
These patterns are then used to test the STTC decoder chip in
Modes I and II via the Agilent 93000 digital test station. The
detailed measurement results are listed in Table III.
The clock rate of the chip achieves 22.28 MHz, and the
throughput ranges from 2.785 Mbps to 11.14 Mbps for the
4-PSK modulation, from 4.18 to 8.36 Mbps for the 8-PSK
modulation, and 5.57 Mbps for the 16-QAM modulation. The
power consumption ranges from 0.43 mW to 2.45 mW for
different numbers of states and modulation orders, as listed in
Table IV. It can be seen that the power consumption increasesalong with the modulation order and the number of states. As
the modulation order is higher than 8-PSK, or the number of
states is larger than sixteen, the chip reaches the upper power
consumption limits of about 2.4 mW for Mode I and about
2.0 mW for Mode II because the branch metric generator and
path&metric updater were designed for the 8-PSK and 16-state
scheme. Although the STTC Viterbi chip was only designed for
the 8-PSK and 16-state scheme because of the rapid growth in
the numbers of the FSG, PDB, and BMC blocks for 16-QAM
modulation, the chip can still perform the 16-QAM STTC de-
coding at the sacrifice of an increase in the processing latency.
Note that Mode II (for Method II) reduces the power con-sumption of Mode I (for Method I) by , and the
cost overhead for Mode II lies in the PDBs that account for
only 9.38% of the total cell area. The great power reduction is
achieved by the fact that, in Mode II for the slow fading channel,
the branch metrics are computed using the PDBs and are stored
in memory. After most of the branch metrics are prestored in
memory, the high branch metric computation power is replaced
by the lower memory access power.
Although no STTC Viterbi decoder has previously been re-
ported in the literature, the chip is still compared with other rel-
atively state-of-the-art designs, such as the CC Viterbi decoder
[15], [18], [19] and TCM Viterbi decoders [22][24], as listed
in Table V. Because Viterbi decoder performance has a closerelationship with the computational complexity of the trellis di-
agram of the Viterbi decoder, a trellis work-load factor (TWLF)
is defined to indicate the computational complexity per trellis
stage as follows:
where is the arithmetic computational cost for branch metric
computation; is the number of modulated bits; is the state
number; and is the antenna dimensions. and
both equal 1 for the CC and TCM decoders.
The Viterbi decoders in [22] and [23] both use the hamming
distance to calculate the branch metrics, thus the is only
one XOR-gate-count. The for the CC Viterbi decoder in
[15], [18], and [19] represents the gate-count of one subtractor
and one squarer, which is used to compute the Euclidean dis-tances with different word-lengths in a single dimension. The
TCM Viterbi decoder in [24] employs the simplified Euclidean
distance to determined the branch metrics in two dimensions,
and thus the is the gate-count of two subtractors without
any squarer. The for the proposed STTC Viterbi decoder
corresponds to the complex-valued channel-fading multiplica-
tions and Euclidean distance computation [see (1)]. Gate-counts
for XOR, subtractor, squarer, and complex-valued multiplier are
estimated by synthesizing these operators using a standard cell
library. The values in Table V are normalized based on the
design in [23], which is one XOR gate.
Note that, although the proposed chip has lower throughputthan other decoders, the TWLF for the STTC Viterbi decoder
is much larger than those of other Viterbi decoders because
the STTC Viterbi decoder must perform much more complex
branch-metric computation which increases along with the
antenna dimension and modulation order. To identify the pro-
cessing capabilities of Viterbi decoders for different types of
trellis codes, we normalize the chip throughput by multiplying
it with the TWLF. The normalized throughout, as shown in the
bracket of Table V, can be regarded as the trellis processing
capability per second for the Viterbi decoder. If we further
consider the power consumption, we can see that the power
efficiency (throughput per mW) of the proposed STTC Viterbi
decoder outperforms most other Viterbi decoders except for[24]. Note that these comparison values still depend on the
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
12/13
884 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 57, NO. 4, APRIL 2010
TABLE VIGENERATOR COEFFICIENT SETS OF THE STTC ENCODER FOR THE SLOW
FADING CHANNEL
fabrication technology. If the proposed decoder is implemented
using an advanced technology, its (normalized) throughput and
power efficiency can be further improved.
V. CONCLUSION
In this paper, a low-complexity STTC decoding algorithm
and its associated hardware architecture are proposed. Method
I (Mode I) is designed for fast fading channels and Method II
(Mode II) is designed for slow fading channels. The complexity
analysis provides the necessary information of the proper
method (mode) to be employed under different configurations.
The STTC decoder is implemented using 0.18 m 1P6M
CMOS technology and supports 4 1, 3 1, and 2 1 config-
urations for 4/8/16-state 4-PSK, 8/16-state 8-PSK, and 16-state16-QAM STTC schemes. Moreover, two modes are offered
TABLE VIIGENERATOR COEFFICIENT SETS OF THE STTC ENCODER FOR THE FAST
FADING CHANNEL
for decoding the received symbols under different fading
channel conditions. This chip yields a maximum throughput of
11.14 Mbps at a power consumption of 0.43 mW. In conclu-sion, an STTC decoder was realized in a silicon chip, which
the authors believe can improve the reliability of future coded
MIMO communication systems.
APPENDIX
GENERATOR COEFFICIENT SETS
See Tables VI and VII.
ACKNOWLEDGMENT
The authors would like to thank Chip Implementation Center
(CIC) of the National Science Council in Taiwan for technical
support. They also thank the anonymous reviewers for theirvaluable suggestions that greatly improved this paper.
REFERENCES
[1] Information Technology-Telecommunications and InformationExchange Between Systems-Local and Metropolitan Area Net-works-Specific Requirements-Part 11: Wireless LAN Medium AccessControl (MAC) and Physical Layer (PHY) Specifications: Amendment4: Enhancements for Higher Throughput 2008, IEEE UnapprovedDraft Standard 802.11n, 4.00.
[2] Local and Metropolitan AreaNetworks Part 16: Air Interface for Fixedand Mobile Broadband Wireless Access Systems Amendment 2: Phys-ical and Medium Access Control Layers for Combined Fixed and Mo-
bile Operation in Licensed Bands and Corrigendum 1, , 2006, IEEEStandard 802.16.
[3] J. Winters,J. Salz, andR. D. Gitlin, The impactof antenna diversity onthe capacity of wireless communication systems, IEEE Transactionson Communications, vol. 42, pp. 17401751, Feb. 1994.
Authorized licensed use limited to: Reva Institute of Tehnology and Management. Downloaded on June 22,2010 at 13:19:26 UTC from IEEE Xplore. Restrictions apply.
8/6/2019 A Low-Complexity Viterbi Decoder
13/13
SHR et al.: A LOW-COMPLEXITY VITERBI DECODER FOR STTCS 885
[4] Y. Jung, J. Kim, S. Noh, H. Yoon, and J. Kim, A digital 120 Mb/sMIMO-OFDM baseband processor for high speed wireless LANs, inProc. IEEE CICC05, Sep. 2005, pp. 8184.
[5] T. Chen, Z. Yu, Y. Peng, Y. Zhang, H. Dai, and X. Liu, A MIMOreceiver SOC for CDMA applications, in Proc. IEEE InternationalSOC Conference, Sep. 2006, pp. 275278.
[6] Y. Jung, J. Kim, S. Lee, H. Yoon, and J. Kim, Design and implemen-tation of MIMO-OFDM baseband processor for high-speed wireless
LANs, IEEE Transactions on Circuits and SystemsPart II: ExpressBriefs, vol. 54, pp. 631635, Jul. 2007.[7] B. Vucetic and J. Yuan, Space-Time Coding. : John Wiley and Sons,
2003.[8] D. Bevan and R. Tanner, Performance comparison of space-time
coding techniques, IEE Electronics Letters, vol. 35, pp. 17071708,Sep. 1999.
[9] S. M. Alamouti, A simple transmit diversity technique for wirelesscommunications, IEEE Journal on Selected Areas in Communica-tions, vol. 16, pp. 14511458, Oct. 1998.
[10] V. Tarokh, N. Seshadri, and A. R. Calderbank, Space-time codes forhighdata rate wireless communication: Performance criterionand codeconstruction, IEEE Transactions on Information Theory, vol. 44, pp.744765, Mar. 1998.
[11] A. Naguib, V. Tarokh, N. Seshadri, and A. Calderbank, A space-timecoding modem for high-data-rate wireless communications, IEEE
Journal on Selected Areas in Communications, vol. 16,pp. 14511458,
Oct. 1998.[12] V. Tarokh, A. Naguib, N. Seshadri, and A. R. Calderbank, Combined
array processing and space-time coding, IEEE Transactions on Infor-mation Theory, vol. 45, pp. 11211128, May 1999.
[13] E. Cavus and B. Daneshrad, A very low-complexity space-time blockdecoder (STBD) ASIC for wireless systems, IEEE Transactions onCircuits and SystemsPart I: Regular Papers, vol. 53, pp. 6069, Jan.2006.
[14] P. J. Black and T. H. Meng, A 140-Mb/s, 32-state, radix-4 Viterbidecoder,IEEEJournal of Solid-State Circuits, vol. 27,pp. 18771885,Dec. 1992.
[15] E. Yeo, S. A. Augsburger, W. R. Davis, and B. Nikolic, A 500-Mb/ssoft-output Viterbi decoder, IEEE Journal of Solid-State Circuits, vol.38, pp. 12341241, Jul. 2003.
[16] C. Cheng and K. K. Parhi, Hardware efficient low-latency architec-ture for high throughput rate Viterbi decoders, IEEE Transactions onCircuits and SystemsPart II: Express Briefs, vol. 55, pp. 12541258,Dec. 2008.
[17] M. D. Shieh, T. P. Wang, and D. W. Yang, Low-power register-ex-change survivor memory architectures for Viterbi decoders, IET Cir-cuits, Devices, & Systems, vol. 3, pp. 8390, Apr. 2009.
[18] F. Sun and T. Zhang, Parallel high-throughput limited search trellisdecoder VLSI design, IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems, vol. 13, pp. 10131022, Sep. 2005.
[19] M. A. Anders, S. K. Mathew, S. K. Hsu, R. K. Krishnamurthy, andS. Borkar, A 1.9 Gb/s 358 mW 16256 state reconfigurable Viterbiaccelerator in 90 nm CMOS, IEEE Journal of Solid-State Circuits,vol. 43, pp. 214222, Jan. 2008.
[20] C. C. Lin, Y. H. Shih, H. C. Chang, and C. Y. Lee, A low powerturbo/Viterbi decoder for 3GPP2 applications, IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 14, pp. 426430,Apr. 2006.
[21] R. Tessier, S. Swaminathan, R. Ramaswamy, D. Goeckel, and W.
Burleson, A reconfigurable, power-efficient adaptive Viterbi de-coder, IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 13, pp. 484488, Apr. 2005.
[22] E. F. Haratsch and K. Azadet, A low complexity joint equalizer anddecoder for 1000Base-T Gigabit Ethernet, in Proc. IEEE CICC00,May 2000, pp. 465468.
[23] A. Dinh and X. Hu, A hardware-efficient technique to implement atrellis code modulation decoder, IEEE Transactions on Very LargeScale Integration (VLSI) Systems, vol. 13, pp. 745750, Jun. 2005.
[24] M. Kamuf, V. Owall, and J. B. Anderson, Optimization and imple-mentation of a Viterbi decoder under flexibility constraints, IEEETransactions on Circuits and SystemsPart I: Fundamental Theory
and Applications, vol. 55, no. 8, pp. 24112422, Sep. 2008.
[25] S. Nandula, Y. S. Rao, and S. P. Embanath, High speed area efficientconfigurable Viterbi decoder for WiFi and WiMAX systems, in Proc.
IEEE ICIAS07, Nov. 2007, pp. 13961399.[26] R. Manzoor, A. Rafique, and K. B. Bajwa, Hardware implementation
of pragmatic trellis coded modulationappliedto 8PSK and16QAM forDVB standard, in Proc. IEEE CNSDSP08, Jul. 2008, pp. 363367.
[27] H. Lee and M. P. Fitz, Systematic expansion of full diversity space-time multiple tcm codes for two transmit antennas,IEEE Transactions
on Wireless Communications, vol. 7, pp. 20272032, Jun. 2008.[28] T. M. H. Ngo, G. Zaharia, S. Bougeard, and J. F. Helard, Design ofbalanced QPSK space-time trellis codes for several transmit antennas,in Proc. IEEE ISSCS07, Jul. 2007, vol. 2, pp. 14.
[29] D. Kim and H.-W. Choi, Advanced constant multiplier for multi-path pipelined FFT processor, IET Electronics Letters, vol. 44, pp.518519, Apr. 2008.
[30] H. T. Nguyen and A. Chattejee, Number-splitting with shift-and-adddecompositionfor power and hardware optimization in linear DSP syn-thesis,IEEE Transactions on Very Large ScaleIntegration (VLSI) Sys-tems, vol. 8, pp. 419424, Aug. 2000.
[31] O. Collins and F. Pollara, Memory Management in Traceback ViterbiDecoders Jet Propulsion Laboratory, 1988, TDA Progress Report42-99.
[32] T. K. Truong, M. T. Shih, I. S. Reed, and E. H. Satorius, A VLSIdesign for a trace-back Viterbi decoder, IEEE Transactions on Com-munications, vol. 40, pp. 616624, Mar. 1992.
Kai-Ting Shr (S08) was born in Taiwan in 1983.He received the B.S. degree in electrical engineeringfrom National Tsing-Hua University (NTHU),Hsinchu, Taiwan, in 2005, where he is currentlyworking toward the Ph.D. degree.
Hisresearchinterests include VLSI designand im-plementation of the low-power and high-throughputcommunication system.
Hong-Du Chen was born Taiwan in 1982. Hereceived the B.S. degree in electronic engineeringfrom Chang-Gung University, Taoyuan, Taiwan,in 2004 and the M.S. degree in communicationsengineering from National Tsing-Hua University,Hsinchu, Taiwan, in 2007.
He is currently in military service in Taiwan. Hisresearch interests includes VLSI design and imple-mentation of the communication applications.
Yuan-Hao Huang (S98M02) was born in Taiwanin 1973. He received the B.S. and Ph.D. degrees inelectrical engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 1995 and 2001, respectively.
He was a Member of Technical Staff with VXISTechnology Corporation, Hsin-Chu, Taiwan from2001 and 2005. Since 2005, he has been withthe Deparment of Electrical Engineering and theInstitute of Communications Engineering, NationalTsing-Hua University, Taiwan, where he is currentlyan Assistant Professor. His research interests include
VLSI design for digital signal processing systems and telecommunicationsystems.