Energy Efficient Hardware Architecture of LU Triangularization for MIMO Receiver

632 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 57, NO. 8, AUGUST 2010

Energy Efficient Hardware Architecture ofLU Triangularization for MIMO Receiver

Ji-Woong Choi, Senior Member, IEEE, Jungwon Lee, Member, IEEE,Byung Gueon Min, and Jongsun Park, Member, IEEE

Abstract—An energy-efficient hardware architecture ofcomplex-valued matrix lower-upper (LU) triangularization formulti-input–multi-output (MIMO) receivers is presented in thispaper. In the LU triangularization process, Gaussian eliminationoperation is expressed as a series of vector-scalar products, wherebasic common computations can be precomputed and sharedto reduce computational complexity. Our computation-sharing-based architecture was implemented using a 0.25-μm CMOSprocess, and the hardware can perform LU triangularization from2 × 2 to 8 × 8 matrices. Numerical results show that the proposedarchitecture has considerable energy savings over conventionalmatrix triangularization schemes.

Index Terms—Low-power very large scale integration (VLSI)design, LU triangularization, matrix decomposition, multi-input–multi-output (MIMO) demodulation.

I. INTRODUCTION

R ECENT advances in mobile communication and com-puting applications demand low-power very large scale

integration (VLSI) implementation of digital signal processing(DSP) systems. As an example, in wireless communicationtechnologies, a MIMO system [1] recently achieves higherspectral efficiency with improved link reliability. However, atthe cost of bandwidth efficiencies, computational complexitysignificantly increases in the receiver side, thus the designof a low-power MIMO receiver being one of the key designconsiderations.

Matrix triangularization and inversion are one of the mostessential components of MIMO receivers [2]. In particular,for MIMO demodulation, complex-valued matrix triangulariza-tion is critical to solve linear equations [3]. Matrix triangu-larization generally needs a large number of complex-valuedmultiplications. In [4], instead of directly calculating multipli-cation, square root, and division operations, read-only memory(ROM)-based lookup tables are used. The major difficultyencountered in this approach is that the ROM size and thequantization error are always tradeoff issues, and considerable

Manuscript received March 11, 2009; revised March 6, 2010; acceptedMay 3, 2010. Date of current version August 13, 2010. This work was supportedin part by the Ubiquitous Computing and Network (UCN) Project, Knowledgeand Economy Frontier R&D Program of the Ministry of Knowledge Economyin Korea under UCN’s Subproject 10C2-C2-30S. This work was recommendedby Associate Editor M. Chakraborty.

J.-W. Choi and J. Lee are with Marvell Semiconductor, Santa Clara, CA95054 USA (e-mail: [email protected], [email protected]).

B. G. Min is with Mewtel Technology, Seoul 143-200, Korea.J. Park is with the School of Electrical Engineering, Korea University, Seoul

136-701, Korea (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSII.2010.2055991

ROM area penalty is inevitable to obtain the data with high res-olution. Parallel-processing-based [5], systolic-array-based [6],[7], and circular-linear-array-based [8] architectures for matrixdecomposition are also proposed. However, large-size matricesare the main target, and the architectures are more optimizedfor high performance and high throughput than low power.

In this paper, we propose an energy-efficient architecturefor complex-valued matrix LU triangularization. The triangu-larization can be directly used for solving linear equations inMIMO demodulation and also for matrix inversion. The burdenof huge multiplication operations in matrix triangularizationis mitigated by a computation sharing multiplier (CSHM) [9],where common computations are identified and shared withoutadditional memory. Although the proposed scheme is focusedon LU triangularization, the same approach can be applied toCholesky triangularization when the matrix is a Hermitian pos-itive definite. The main contributions of this work are summa-rized as follows: 1) A novel low-power architecture for matrixdecomposition is proposed. Considering the hardware area andthroughput tradeoff, a CSHM-based sequential architecture ispresented. 2) A new CSHM-based complex-valued multiplieris proposed. For the power estimation of general computation-sharing-based DSP architectures, power consumption of eachCSHM component is also analyzed.

II. SYSTEM MODEL AND LU TRIANGULARIZATION

A. MIMO System

We consider a spatial multiplexing MIMO system with Nt

transmit and Nr receive antennas [1], where Nt = Nr = N .Although Nr can be larger than Nt, we assume Nt = Nr forsimple description. Omitting time indexes without loss of gen-erality, the received signal can be represented in discrete time as

y = Hx + z (1)

where y is the (N × 1) received signal vector, andx = [x1 · · · xN ]T is the (N × 1) transmitted signal vector.Here, xi, i = 1, . . . , N , is the symbol sent from the ith transmitantenna, which is taken from a complex modulation constella-tion. z is the (N × 1) complex additive white Gaussian noisevector with variance σ2

z , and H = [h1,h2, . . . ,hN ] is the (N ×N) channel matrix, where hm = [hm,1 hm,2 · · · hm,N ]Tis the channel gain vector from the mth transmit antenna to thereceiver. Although we assume a flat-fading channel, the resultscan be extended to frequency-selective fading channels with theuse of orthogonal frequency-division multiplexing (OFDM).

At the receiver, the MIMO demodulator has to determinewhich data symbol is sent. Among numerous demodulation

1549-7747/$26.00 © 2010 IEEE

CHOI et al.: ENERGY EFFICIENT HARDWARE ARCHITECTURE OF LU TRIANGULARIZATION FOR MIMO RECEIVER 633

methods [1], a linear equalizer is commonly employed in prac-tice due to its implementation simplicity. The linear equalizermultiplies a matrix G with the received noisy symbol y toseparate the received signal into its transmitted stream as

x̂eq = Gy. (2)

This x̂eq is mapped to the closest transmit signal vector. Thelinear equalizers differ according to G. The zero-forcing (ZF)linear equalizer perfectly compensates channel distortion byinverting the channel, i.e., G = H−1. Since the ZF equalizermay experience noise enhancement when Gz is large, the min-imum mean square error (MMSE) linear equalizer can be usedinstead to minimize the total error composed of the channelcompensation error and the noise enhancement using G =(HHH + σ2

zI)−1HH , where H is the conjugate transpose.

Note that the x̂eq of (2) is usually computed without matrixinversion of H or (HHH + σ2

zI). That is, for the ZF equalizer,x̂eq can be obtained by solving

Hx̂eq = y. (3)

Here, the channel matrix H can be decomposed as H = LU,where L and U are the lower and upper N × N triangularmatrices, respectively. Then, (3) can be solved successively asin the following two steps:

Lp =y, for pUx̂eq =p, for x̂eq. (4)

Note that, in both steps, triangular matrices are involved inthe equations, which can be easily solved using forward andbackward substitution. By performing LU decomposition once,the MIMO demodulation, i.e., (4), can be performed multipletimes, i.e., for y with different time indexes, while the channelis unchanged.

The similar procedure can be employed for the MMSE linearequalizer, except that HHy is used, instead of y, as

LUx̂eq = HHy (5)

where LU = HHH + σ2zI. Since (HHH + σ2

zI) is Hermitianand positive definite, it can be decomposed in a simpler mannerusing Cholesky decomposition (HHH + σ2

zI) = LLH . Here,Cholesky decomposition can also be regarded as a special caseof LU decomposition where U = LH .

B. LU Decomposition Algorithm

As described in the previous section, LU decomposition isa key function of the linear equation required for the MIMOdemodulation. In addition, once the LU decomposition is per-formed, the inversion of any nonsingular matrix A can beeasily obtained since A−1 = (LU)−1 = U−1L−1. Here, theinversion of upper (or lower) triangular matrix U can beperformed by directly solving XU = I, which leads to easybackward substitution. Thus, we focus on implementation ofLU decomposition in the rest of this paper.

LU decomposition of a nonsingular matrix A can be per-formed as in Fig. 1 [10]. Fig. 1(a) presents the pseudocodefor LU decomposition. Fig. 1(b) also illustrates an examplerepresenting how a matrix A is decomposed into lower andupper triangle matrices L and U, respectively. The procedure

Fig. 1. (a) LU decomposition pseudocode. (b) Example on LUdecomposition.

shown in the example is called the Gaussian elimination (GE),and the element in the black circle represents the pivot. Wecan notice in this process that the significant elements of Land U are stored in place in matrix A, which means that thememory area can be saved in the hardware implementation. Forthe recursive step in Fig. 1(a), the process can be divided into1) pivot column element calculations using division operationsand 2) the rest of the matrix update using a series of multiplica-tions and subtractions. In the following sections, we present thatpart of the GE process is expressed as a series of vector-scalarproducts, where we can achieve considerable power savingsbased on computation reuse.

III. CSHM APPROACH

A. CSHM Multiplier

CSHM was proposed [9] to reduce the power consump-tion of a finite-impulse response filter with real-valued co-efficients. As shown in Fig. 2(a), CSHM is composed ofprecomputer and select units and adders (S & A). In CSHM,most of basic computations 1x, 3x, 5x, 7x, 9x, 11x, 13x,and 15x are performed in the precomputer banks, and theoutputs of the precomputer are simply shifted and added togenerate the final output. As an example, when computingC · x, C = 1110_0100_1101_1010, C · x, can be decomposedas 213(111x) + 210(1x) + 24(1101x) + 21(101x). The basicoperations 1x, 5x(101 · x), 7x(111 · x), and 13x(1101 · x) arecomputed in the precomputer bank, and the rest of the shift andadd operations are performed in the S & A. Fig. 2(b) shows the16 × 16 CSHM.

For a single multiplication operation, the CSHM is not powerefficient, compared with the generally used Wallace tree multi-plier (WTM). Fig. 2(c) shows the power comparison of CSHMwith WTM when those multipliers are implemented usingTSMC 0.25-μm technology. Power consumption is measuredat 100 MHz with 2.5 V. 16 × 16 WTM consumes only 81% ofpower, compared with CSHM. Inside CSHM, the precomputer


Fig. 2. (a) CSHM architecture. (b) 16 × 16 CSHM. (c) Power consumptioncomparison between CSHM and WTM.

Fig. 3. Power consumption comparison between (a) the CSHM-based vectorscaler and (b) the WTM-based vector scaler.

bank and S & A have 55% and 45% of power consumptions,respectively.

B. CSHM-Based Vector Scaling

Vector-scaling operation refers to the multiplication of agiven vector C = [c0, c1, . . . , cM−1] with the scalar X . Inthis operation, the input X is multiplied by all the coeffi-cients c0, c1, c2, . . . , cM−1. Fig. 3 shows the power consump-tion comparison between CSHM-based and WTM-based vectorscalers. For the CSHM-based vector scaler, which computes[c0, c1, . . . , c9] × X , the basic computations 1x, 3x, 5x, 7x, 9x,11x, 13x, and 15x are performed only once in precomputer,

Fig. 4. CSHM-based complex multiplier.

TABLE IPOWER CONSUMPTION OF DIFFERENT COMPLEX

MULTIPLIER ARCHITECTURES

and these values are shared by all the S & A’s for generatingci · x, i = 0, 1, 2, 3, . . . , 9. Therefore, the computation powerof vector scaler using CSHM is approximately 115.47 mW, asshown in Fig. 3(a). However, if we use isolated WTM for vectorscaler implementation [Fig. 3(b)], the power consumption is186.3 mW, assuming that the test vectors are the same for thosemultipliers in the vector scaler.

C. Complex-Valued Arithmetic Using CSHM

Complex-valued signal and channel matrices are widely usedin MIMO systems, where the multiplication of two complexnumbers is one of the most common functions. The proposedCSHM approach can be extended to complex-valued multipli-cations and vector-scaling operations.

Generally, the multiplication of two complex numbers (a +bj) and (c + dj) can be expressed as

(a + bj)(c + dj) = (ac − bd) + (ad + bc)j. (6)

To reduce power consumption, (6) can be reformulated usingstrength reduction [11] as follows:

(a − b)d + (c − d)a = ac − bd,

(a − b)d + (c + d)b = ad + bc (7)

where (a − b)d is used for both equations. For hardware im-plementations, direct mapping of multipliers and adders on(6) and (7) is generally used. CSHM can also be used toimplement complex multiplication, which is shown in Fig. 4.When those three different architectures are used to implementcomplex multiplication, Table I shows the power consumptioncomparison. For the direct-mapped architectures of (6) and(7), general WTM and ripple carry adders are used for theimplementations. As a single complex multiplier, the CSHM-based architecture consumes around 6% more power than thestrength-reduction-based complex multiplier.

When CSHM is used to implement complex-valuedvector-scaling operation, considerable power savings can beachieved. As an example, for CSHM-based implementation of[c0, c1, . . . , c9] × X , where c′is, i = 0, 1, 2, . . . , 9, are complex-valued numbers, we need two precomputer banks and 40 S &A’s in the vector scaler. The CSHM-based ten tap complex-valued vector scaler has power savings of around 30% over thestrength-reduction-based (7) complex-valued vector scaler.

CHOI et al.: ENERGY EFFICIENT HARDWARE ARCHITECTURE OF LU TRIANGULARIZATION FOR MIMO RECEIVER 635

Fig. 5. (a) Proposed LU triangularization architecture. (b) Timing diagram of the 2 × 2 LU decomposition process.

IV. CSHM-BASED HARDWARE ARCHITECTURE

FOR LU TRIANGULARIZATION

A. Hardware Architecture

The CSHM-based approach can be effectively used for theimplementation of complex-valued matrix LU triangularizationoperation in Section II-B. For example, a 4 × 4 complex matrixis represented as

A =

⎡⎣ c11 c12 c13 c14

c21 c22 c23 c24c31 c32 c33 c34c41 c42 c43 c44

⎤⎦ (8)

and let us assume that c11 is the pivot. As described in the LUdecomposition procedure shown in Fig. 1, the complex numbersin the same column are first divided by c11

1c11

[c21, c31, c41] =a11 − b11i

a211 + b2

11

[c21, c31, c41] (9)

where c11 = a11 + b11i. Note that (9) represents a complexvector-scaling operation. After computing ck1/c11, k = 2, 3,4, those numbers are again multiplied with pivot row, and itis subtracted from the kth row in GE. Therefore, the operationon the kth row, k = 2, 3, 4, is expressed as[

ck1

c11, ck2, ck3, ck4

]− ck1

c11[0, c21, c31, c41] (10)

which includes vector-scaling operations as well.Fig. 5(a) shows the CSHM-based hardware architecture for

LU triangularization. When high-throughput vector-scaling op-eration is required, a parallel architecture is generally used,as shown in Fig. 3(a), where many S & A’s are operatingsimultaneously. However, when low-throughput operation isallowed, a sequential architecture is used to reduce the hardwarearea. Pivoting, which is usually required for numerical stabilitywhen the diagonals are zero or close to zero, is not included inthis implementation. Fig. 5(b) also presents its timing diagram

for calculating (9) and (10). Since (a11 − b11i) is multiplied/shared with the complex vector [c21, c31, c41] in (9), once1a11, 3a11, . . . , 15a11 and 1b11, 3b11, . . . , 15b11 are computedin the precomputers, the precomputer outputs are not changinguntil the whole vector-scaling operation finishes. Meanwhile,there is no dynamic power consumption in the precomputerbanks, and the precomputer outputs are shared, where compu-tation energy is saved. In order to reduce quantization error,division by (a2

11 + b211) is performed after the vector-scaling

operation. Read/write memory access times for 2 × 2 matrix arealso specified in Fig. 5(b). Since simultaneous memory accessfor read and write is required, two-port (one read and one writeports) memory is used. For divider implementation, a three-stage pipelined divider is used, and the bit widths of the dividerand multiplier are decided as 24 and 16 bits, respectively.

The proposed architecture can perform LU triangularizationfor 2 × 2, 3 × 3, 4 × 4, 6 × 6, and 8 × 8 matrices. Fordifferent sizes of matrices, the control signals are generatedfrom the control logic in Fig. 5(a). As the size of the matrixbecomes larger, the number of clock cycles required to finishthe triangularization operation increases as well.

B. Numerical Results

The proposed architecture is implemented using tsmc0.25-μm technology. Power consumption is measured at100 MHz with 2.5-V supply voltage using more than a hundredtest matrices. Spice-level simulations with nanosim [12] areused. In order to compare various matrix triangularizationarchitectures, direct mapping of (6), strength-reduction-based(7) architectures, and circular linear array-based architecture[8] are implemented. Low-power coordinate rotation digitalcomputer (CORDIC)-based [13] QR decomposition [6] archi-tecture is also implemented and compared. Table II shows thenumerical results for the five different architectures. In thetable, clock cycles refers to the number of clock cycles neededto finish one matrix triangularization, power represents the


TABLE IIPOWER CONSUMPTION AND CLOCK CYCLE COMPARISON FOR DIFFERENT MATRIX TRIANGULARIZATION ARCHITECTURES

average power dissipation during those clock cycles, and powerdelay product means the energy consumed to finish one LUtriangularization operation. For example, the QR decomposi-tion architecture takes 24 cycles to finish 2 × 2 complex-valuedmatrix triangularization, and during those 24 cycles, the averagepower dissipation is 15.50 mW, finally consuming a total energyof 3.72 nJ to perform the triangularization operation.

As expected, as the matrix size increases, the proposedCSHM-based architecture shows more energy savings overthe other approaches due to computation sharing. Energy sav-ings are 29% and 19% over the direct-mapped-based (6) andstrength reduction-based (7) architectures, respectively, in caseof the 8 × 8 matrix. Note that the circular linear array-basedarchitecture [8] is proposed for only real-valued matrix decom-position. The computational complexity of complex-valued ma-trix LU triangularization is approximately four times larger thanthat of the real-valued one. In case of the CORDIC-based QRdecomposition approach, the CORDIC-based approach needsmore cycles to finish the triangularization process. Moreover,as the size of the matrix increases, the energy consumptionis increasing more abruptly due to the larger computationalcomplexity of QR decomposition.

In OFDM systems, where the MIMO technique is widelyapplied in practice, LU decomposition of the channel matrixis required at most every a few microseconds in the worstcase where the channel is assumed to be different in eachsubcarrier and the channel varies with the maximum speed, e.g.,300 km/h [14], [15]. Thus, taking 0.1−1.6 μs (10–160 cyclesat 100 MHz) for LU triangularization is fast enough forreal-time MIMO processing. When higher throughput LU tri-angularization is required, however, parallel CSHM-based ar-chitecture can be used with comparable energy savings.

V. CONCLUSION

Energy-efficient hardware architecture for LU triangulariza-tion has been proposed in this paper. In the process of LUtriangularization, GE with pivoting can be expressed as a seriesof vector-scaling operations, where the CSHM scheme canbe efficiently used for low-power implementation. Numericalresults from hardware implementation have shown significantpower savings. Our proposed approach is also directly ap-plicable to Cholesky triangularization when the matrix is a

Hermitian positive definite. The idea presented in this paper canbe applied to various applications including MIMO receiverswhere the matrix triagularization/inversion operation is criticalfor energy-efficient implementation.

ACKNOWLEDGMENT

The authors would like to thank the IC Design EducationCenter (IDEC) for its software assistance.

REFERENCES

[1] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time WirelessCommunications. Cambridge, U.K.: Cambridge Univ. Press, 2003.

[2] T. Kailath, H. Vikalo, and B. Hassibi, “MIMO receive algorithms,”in Space-Time Wireless Systems: From Array Processing to MIMOCommunications. Cambridge, U.K.: Cambridge Univ. Press, 2005.

[3] N. J. Higham, Accuracy and Stability of Numerical Algorithms.Philadelphia, PA: SIAM, 1996.

[4] C. K. Singh, S. H. Prasad, and P. T. Balsara, “VLSI architecture for matrixinversion using modified Gram-Schmidt-based QR decomposition,” inProc. VLSI, Jan. 2007, pp. 836–841.

[5] D. C. Yu and H. Wang, “A new parallel LU decomposition method,”IEEE Trans. Power Syst., vol. 5, no. 1, pp. 303–310, Feb. 1990.

[6] A. Maltsev, V. Pestretsov, R. Maslennikov, and A. Khoryaev, “Tri-angular systolic array with reduced latency for QR-decompositionof complex matrices,” in Proc. IEEE Int. Symp. Circuits Syst., May 2006,pp. 385–388.

[7] D. Kim and S. V. Rajopadhye, “An improved systolic architecturefor LU decomposition,” in Proc. IEEE Int. Conf. Appl—Specific Syst.,Sep. 2006, pp. 231–238.

[8] G. Govindu, S. Choi, V. Prasanna, V. Daga, S. Gangadharpalli, andV. Sridhar, “A high-performance and energy-efficient architecture forfloating-point based LU decomposition on FPGAs,” in Proc. ParallelDistrib. Process. Symp., Apr. 2004, pp. 149–156.

[9] J. Park, W. Jeong, H. Mahmoodi, Y. Wang, H. Choo, and K. Roy,“Computation sharing programmable FIR filter for high performanceand low power applications,” IEEE J. Solid States Circuits (JSSC), vol. 39,no. 2, pp. 348–357, Feb. 2004.

[10] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction toAlgorithms, 2nd ed. Cambridge, MA: MIT Press, 2001.

[11] K. K. Parhi, VLSI Digital Signal Processing Systems: Design andImplementation. New York: Wiley, 1999.

[12] Nanosim Reference Guide, Synopsys Inc., Mountain View, CA, Jun. 2004.[13] K. Sarigeorgidis and J. Rabaey, “Ultra low power CORDIC processor for

wireless communication algorithms,” J. VLSI Signal Process., vol. 38,no. 2, pp. 115–130, Sep. 2004.

[14] Draft IEEE 802.16m System Requirements, Jan. 2009, IEEE 802.16m-07/002r8.

[15] Evolved Universal Terrestrial Radio Access (E-UTRA); User Equipment(UE) Radio Transmission and Reception (Rel. 8), Dec. 2008, 3GPP TS36.101.

Documents

Energy Efficient Hardware Architecture of LU Triangularization for MIMO Receiver