A Flexible Sphere Decoder Architecture for MIMO …icslwebs.ee.ucla.edu/dejan/researchwiki/images/3/36/Qualsproposal... · A Flexible VLSI Architecture for Extracting Diversity

1

A Flexible VLSI Architecture for Extracting Diversity and

Spatial Multiplexing Gains in MIMO Channels

Chia-Hsiang Yang

University of California, Los Angeles

Challenges:

1. A unified solution to span the entire diversity-multiplexing tradeoff curve

2. Tradeoff between two search methods

Depth-first: ML performance with variable throughput and long latency

K-best: near ML performance with constant throughput and short latency

3. Antenna array size beyond 4×4

Area increases quadratically with the number of transmit antennas

Critical path increases linearly with the number of transmit antennas

4. Modulations beyond 16-QAM

Hardware increases quickly with the constellation size

Longer latency introduced by the minimum search circuit

5. Multiple sub-carriers

Research Contributions:

1. A unified sphere decoder architecture for extracting diversity and spatial

multiplexing gains in MIMO channels

2. Signal processing techniques to support antenna sizes up to 16×16

Folding: hardware area increases linearly with antenna array size

Loop retiming: reduces the critical path

Data interleaving: supports multiple independent sub-carriers

A region partition enumeration method for constellations up to 64-QAM

3. A flexible architecture

Antenna array: 2×2 to 16×16

Modulations: BPSK to 64-QAM

Number of sub-carriers: 16 to 128

Search method: K-best or depth-first search

4. A simplified multiplier

Numerical strength reduction

Gray coding to reduce number of operations

5. A multi-core architecture for enhanced performance

2

Abstract—Sphere decoding algorithm is widely used in MIMO communications, because of

its ability to approach maximum likelihood detection with significantly reduced

computational complexity. This makes it attractive for hardware implementation; however,

prior work focused only on solutions with fixed number of antennas or fixed modulations.

This work presents a unified sphere decoder architecture that deploys diversity-multiplexing

tradeoff in MIMO channels by taking advantage of the flexibility in the number of antennas

and modulation schemes. Several signal processing and circuit techniques are constructively

combined to reduce the hardware complexity: a 20 times area reduction is achieved even

without interleaving of subcarriers compared to direct-mapped architecture. The proposed

flexible architecture supports antenna arrays from 2×2 to 16×16, modulations from BPSK to

64-QAM, over 16 to 128 sub-carriers. The peak estimated data rate exceeds 1.5 Gbps ideal

throughput using a 16 MHz bandwidth in just 0.55 mm2 in a standard 90 nm CMOS process.

I. INTRODUCTION

Multi-input multi-output (MIMO) communication has recently received

significant attention due to its potential to increase link robustness and channel

capacity. Hardware realization of MIMO signal processing algorithms is quite

challenging, because it requires multi-dimensional, matrix-based, computations.

However, with the growing demand for higher data transmission rates over wireless

links, the need of devices equipped with multiple antennas increases.

Among various MIMO algorithms, sphere decoding is one of the most promising

solutions. It approximates the information theoretic bound, set by the maximum

likelihood (ML) detection, with several orders of magnitude lower computational

complexity [1] [2]. This means that, for a given hardware cost, the reduced

complexity could be utilized to increase the size of antenna array and effectively

improve the performance beyond the ML performance of a system with smaller array

size. The complexity reduction is achieved by transforming an exhaustive search of

the ML decoders into a tree search procedure of sphere decoding. Tree search is quite

popular in other communications areas such as multi-user detection (MUD) for

CDMA systems, block-based demodulation, and linear block error control code

decoding [3]. Other potential applications include speech recognition, data

compression, protein sequence exploration, and neural signal detection.

Sphere decoding algorithm is a multi-dimensional signal processing task dealing

with vector and matrix arithmetic. The required computation involves hundreds of add

and multiply operations, and may also need divide and trigonometric functions. Such

a high complexity limits the system specifications such as antenna array size and

3

modulations. In addition, prior work focused only on solutions with fixed number of

antennas or fixed modulations [16][17][19][21][22][24]. In this work, we evaluate the

architectures proposed in prior work and advance state-of-the-art in the area of

multidimensional matrix-based signal processing hardware. A number of signal

processing techniques [23] are considered jointly with the technology parameters to

greatly reduce hardware area (cost) and power while maximizing the performance.

This work develops an architecture that further simplifies sphere decoding

implementation by jointly considering tradeoffs at the algorithm, architecture, and

circuit layers of abstraction, with the goal of minimizing chip power and area. At the

same time, additional degrees of freedom are considered in the design in order to take

full advantage of the diversity and spatial multiplexing gains available in MIMO

wireless channels [5]. Tuning over a range of diversity-multiplexing points is possible

by varying antenna array size and modulation scheme, for example. Flexibility and

scalability are, thus, key additional requirements in the design of multi-mode,

multi-standard systems. Also, our work uses the Matlab/Simulink framework to

improve design productivity in mapping of DSP algorithms onto silicon. BEE2

platform [38] is used to verify system functionality before entering physical ASIC

design.

This proposal is organized as follows. Section II reviews the fundamental

diversity-multiplexing tradeoff in MIMO communications and describes sphere

decoding algorithm. Several signal processing techniques, evaluated in

power-area-performance space, and architecture details are presented in Section III.

Section IV describes the Simulink design environment and BEE2 emulation platform.

Conclusions are summed up in Section V. Finally, Section VI proposes future work

and the timeline.

II. ALGORITHM SPACE EXPLORATION

A MIMO system can improve the reliability of a wireless link through increased

diversity or improve the channel capacity through spatial multiplexing. Diversity gain

and spatial multiplexing gain are related to system coverage range and data rate,

respectively. Both gains can be improved using a larger antenna array. However,

given a MIMO system, there is a fundamental trade-off between these two gains [4]

[5]. In the diversity-multiplexing space, repetition code, Alamouti code, and

space-time code use data redundancy to increase diversity at the price of losing spatial

multiplexing gain. In contrast, Bell Labs Layered Space Time (BLAST) algorithm,

Singular Value Decomposition (SVD), and QR decomposition allocate data-streams

4

in different eigen-modes to maximize spatial multiplexing gain while sacrificing

diversity gain, as shown in Fig. 1.

Sphere decoding is a decoding scheme that can extract both diversity and

multiplexing gains. With flexibility in coding and modulation, sphere decoder can

effectively explore the entire tradeoff curve as shown in Fig. 1. The original data type

for sphere decoding is uncoded data. By manipulation of input data, sphere decoding

is capable of decoding space-time block codes (STBC), which improves the error

probability and increases diversity gain. The data rate can be maximized by

transmitting different modulations over different MIMO substreams to increase

spatial multiplexing gain. Also, with proper preprocessing, the decoding process starts

from decoding the symbols with highest SNR first, and then canceling the effect of

the decoded symbols for remaining symbols until the final symbol is decoded. This

decoding sequence is equivalent to that in BLAST [41]. A unified sphere decoder

model is illustrated in the following section.

Spatial multiplexing gain (rate)

Div

ers

ity g

ain

(ra

ng

e)

Sphere

decoding

array size

array size

RepetitionRepetitionAlamoutiAlamouti

SpaceSpace--timetime

BLASTBLASTSVDSVDQRQR

Spatial multiplexing gain (rate)

Div

ers

ity g

ain

(ra

ng

e)

Sphere

decoding

array size

array size

RepetitionRepetitionAlamoutiAlamouti

SpaceSpace--timetime

BLASTBLASTSVDSVDQRQR

Fig. 1. Diversity-Multiplexing tradeoff in MIMO communications.

A. Sphere Decoding Algorithm

Consider a multiple antenna system with M transmitter antennas and N receive

antennas. The received vector y can be represented by

nHsy (1)

where y is an N1 vector of received symbols, and H denotes an NM channel matrix

whose elements are i.i.d. complex Gaussian with zero mean and unit variance. Vectors

s and n (M1 and N1 respectively) represent the transmitted symbols and zero mean,

circularly symmetric white Gaussian noise, respectively. The transmitted vector

Qs with the smallest Euclidean distance is selected as ML estimate in (2). The

5

channel matrix can be decomposed further using QR factorization; the equivalent ML

estimate thus can be written as

2||||minargˆ Hsys 2||ˆ||minarg Rsy (2)

with ZF

RsyQy Hˆ

where Q is a unitary matrix, R is an upper triangular matrix, and yHH)(Hs 1

ZF

HH

is the zero-forcing (unconstrained ML) estimate. The signal model is presented in Fig.

2.

H QH

Channel

n

sy

H=QR

y

TX RX

Sphere

Decoders^

min|| -Rs||2ys=arg^

^

^

Fig. 2. Signal model of sphere decoding algorithm.

The most commonly used methods for QR decomposition are Grahm-Schmidt

decomposition, Householder transformation, and Givens rotations [7]. Several

modifications such as division free, or square-root and division reduction methods are

proposed to simplify the operation in the original algorithm [45] [46]. For hardware

realization, [8] proposed an algorithm suitable for fixed-point implementation and [9]

proposed a CORDIC-based triangular systolic array architecture to reduce latency.

Under the assumption of block fading channel, QR decomposition is computed at the

packet rate.

Using the upper triangular nature of R, the symbol decoding begins from the last

row and occurs in several steps. The decoded symbols are used for successive

decoding steps until all symbols are decoded. This decoding algorithm can be mapped

to finding a shortest path (with minimum Euclidean distance) in a tree topology – one

possible constellation point denotes one node, each row of the R matrix is mapped to

each level of the tree whose edges are weighted by channel coefficients. The whole

solution space of this tree is equivalent to exhaustive search in the trellis diagram of

the original problem; number of total combinations of transmitted symbols is |Q|M

,

where |Q| is the constellation size.

By properly choosing a search radius and a search method, the ML solution can be

approached by visiting only nodes within a hyper-sphere, rather than performing an

6

exhaustive search. This complexity reduction is feasible, because the Euclidean

distance is a cumulative sum of square terms. This means that for each node, if its

Euclidean distance is larger than the search radius, the corresponding branches are

outside the search radius as well. The conceptual view of sphere decoding algorithm

is illustrated in Fig. 3. Tree pruning technique makes sphere decoding achieve ML

performance with polynomial complexity (highlighted nodes in Fig. 3) rather than

exponential complexity (all nodes in Fig. 3) [1].

. . .

. . .

. . .

...

...

...

...

...

...

constellationsize

ant-M

ant-2ant-1

search radius

Fig. 3. Concept of sphere decoding. Unlikely nodes and branches are indicated with gray shade.

B. Performance Improvements

Several simple yet effective methods such as detection ordering, candidate

enumeration and search radius setting are applied to improve error performance

and/or reduce the complexity the basic sphere decoding [3]. For instance, the sphere

decoding algorithm for 44 64QAM system as compared to exhaustive search results

in over 105 times reduction in computational complexity [10].

1) Detection Ordering: The idea behind detection ordering is to detect symbols with

the largest SNR first: to avoid discarding the ML solution, the first decoded symbols

should be the most reliable. Various ordering algorithms have been proposed for the

preprocessing stage: V-BLAST-ZF ordering, V-BLAST-MMSE, and Norm ordering [3]

[25]. Assuming a packet-based wireless communication system, the ordering only

needs to be performed once at the beginning of each received frame.

2) Candidate Enumeration: Detection ordering is applied across levels in the tree

topology. For each level, the order of constellation point enumeration is another

important factor to improve search speed. Schnorr-Euchner (SE) enumeration

suggests traversing the constellation candidates according to the cumulative distance

increment in an ascending order [2]. Therefore, the first candidate si for each row is

the one with minimum distance between bi and RiiQi in (3). Finding a good admissible

7

solution early means that we can shrink our initial radius early.

iiii

M

ij

jijisRbsRy

ˆ with

M

ij

jijiisRyb

1

ˆ . (3)

3) Search Radius Setting: One major feature of sphere decoding is the radius

shrinking. Once a solution is found with a smaller Euclidean distance, the search

radius is updated to this value so that more unlikely branches can be pruned. However,

the initial choice of search radius is not easy for sphere decoding, because the choice

of search radius influences the complexity of the algorithm. When search radius is too

large, a very high number of visited nodes is in the solution space which causes high

detection complexity. Conversely, when the search radius is too small, this may result

in an empty sphere and no available solution.

Based on AWGN model, sum of noise square is central-chi-square distributed

with 2M degrees of freedom [11] [47]. Given the channel SNR, the search radius can

be decided by solving the probability density function (pdf) with a confidence interval.

If channel SNR is unknown, the Euclidean distance of zero-forcing solution can be

used as an initial guess. Algorithm with increasing search radius was proposed, which

starts the search with a strict search radius first, and expands the search radius if no

solution is available within the radius [12] [48].

C. Tradeoff in Diversity-Multiplexing Space

A unified sphere decoder architecture is illustrated here for extracting diversity

and spatial multiplexing gains along the tradeoff curve. We demonstrate that adding

flexibility in varying antenna size and varying modulations is the key features for this

purpose. Antenna array size provides an added flexibility to shift the tradeoff curve in

the diversity-multiplexing space.

In order to maximize diversity gain, we have to supply to the receiver multiple

independently faded replicas of the same symbol, so that the error probability is

reduced [13] [14]. The data replicas can be sent in space and/or time directions. Since

a unified signal model can be developed for these space-time (ST) coding schemes,

the same sphere decoder architecture can be used with some data rearrangement.

Sphere decoding supporting algebraic ST codes [48], linear dispersion code [49], and

space time block code (STBC) [15] were reported in prior work. The ML estimate can

be written as

2||||minargˆ Bsys (4)

8

where matrix B depends on code generators and channel matrix. By interpreting B as

H in the original signal model, sphere decoding algorithm can be applied. Since the

matrix dimension we deal with is changed due to the data rearrangement in the

preprocessing stage, the equivalent antenna array size will be changed accordingly.

For example, repetition coding by 2 in space domain for an 88 system will be

transformed into data processing in a 44 system (only one half of symbols need to be

decoded). This requirement enhances the need for flexibility in antenna array size.

Spatial multiplexing gain is characterized by data rate. To maximize spatial

multiplexing gain, we should allow data rate to scale with the SNR or assign different

data rate to different substreams for a fixed SNR [5][15]. To this end, modulation

scheme should be adaptive according to channel condition: a larger constellation is

applied to substreams with higher SNR, and a smaller constellation is applied to

substreams with lower SNR. In principle, this transmission strategy just uses

water-filling in space domain. The system performance perspective, therefore, further

motivates the need for adding flexibility in modulation schemes.

III. ARCHITECTURE SPACE EXPLORATION

The optimal architecture is decided by jointly considering tradeoffs at the

algorithm, architecture, and circuit layers of abstraction, with the goal of minimizing

chip power and area. As shown in Fig. 4, a layered design approach is adopted to

merge algorithm and circuit decisions. An efficient multiplier is proposed to reduce

area and delay at the same time. Saving in area directly translates to power reduction

since power spent in charging/discharging parasitic capacitances is also reduced. At

the processing element (PE) architecture level, we evaluate the existing architectures

[16][17][19][21][22][24] and propose a solution with improved area and throughput.

Unlike prior work, flexibility is also considered in the design stage. Antenna size,

modulation scheme, number of subcarriers, and search method are designed with

flexibility and scalability to cover multiple communication scenarios. A multi-core

architecture which consists of many PEs (―small cores‖) is developed to support the

tradeoff between range and data rate at the system architecture level. We finally

summarize the flexibility, scalability, and system specification.

9

...

...

S1^S0

<<21

0

neg

<<1

0

1

1

-1

x4

x8

1

0

neg

S2

S0

S1&S0_b

R0

1

PE a

rch.

Met

ric

calc

.

Multip

lier

Sys

tem

arc

h.

...

...

S1^S0

<<21

0

neg

<<1

0

1

1

-1

x4

x8

1

0

neg

S2

S0

S1&S0_b

R0

1

PE a

rch.

Met

ric

calc

.

Multip

lier

Sys

tem

arc

h.

Fig. 4. Illustration of layered design approach.

A. Numerical Strength Reduction

From an algorithm perspective, the complexity of sphere decoding is evaluated by

the number of nodes visited in the tree search process. When considered for hardware

implementation, decoding algorithms are generally compared in terms of the number

of multiplications. Down to the circuit level, the size of multipliers is the key factor to

estimating the area, speed, and power of the sphere decoder.

We start with simplifying the cost of the multiply operation to reduce hardware

complexity. The multiplication is required to calculate Euclidean distance, which is

mathematically represented by two equivalent forms, Eqs. (5), (6).

2||)(||minargˆZFML

ssRs (5)

2||||minarg RsyQ H (6)

Seemingly, the number of multiplications in Eq. (5) is less than in Eq. (6): one

multiplication for Eq. (5) and two multiplications for Eq. (6). Hence, Eq. (5) was most

commonly used in prior work [16]-[21] as a baseline for implementation. However, a

careful investigation shows that Eq. (6) is a better choice from hardware perspective

for at least two reasons. First, we observe that sZF and QHy can be pre-computed and,

hence, have negligible impact on the total number of operations. Also, computation

effort of sZF is not less than QHy. Second, the wordlength of s is usually much shorter

than sZF. Separating terms as in Eq. (6) results in multipliers with reduced wordlength.

Without loss of generality, the normalized size of a multiplier can be estimated by

the product of wordlength of the multiplier and multiplicand. The normalized delay of

a multiplier can be estimated by the sum of wordlength of the multiplier and

10

multiplicand if an array multiplier is used [39]. The array multiplier approximation

works well for first-order comparison purposes. Table 1 summarizes the relative area

and delay reduction of a multiplier due to numerical strength reduction in a 64-QAM

system, where wordlength (WL) of s is 3 for a real multiplier. We see that the area

reduction is at least 50%, and that the delay reduction also reaches 50% for large

wordlength inputs. The absolute area difference between these two types of

multipliers is amplified by the total number of multiplications in the entire decoding

process, which is approximately O(M3).

TABLE I

AREA AND DELAY REDUCTION DUE TO NUMERICAL STRENGTH REDUCTION

WL of sZF 6 8 10 12

WL of R =12, Area/delay 0.5/0.83 0.38/0.75 0.3/0.68 0.25/0.63

WL of R =16, Area/delay 0.5/0.68 0.38/0.63 0.3/0.58 0.25/0.54

The multiplier can be simplified further by taking advantage of some

characteristics of communication signal processing: Gray coding and quantization

effects. Gray code is a more compact representation in the constellation plane since

only odd numbers are used. Conventionally, the number is transformed to 2’s

complement representation for the purpose of arithmetic operations. Carefully

examining the Gray code representation, the corresponding multiplication can be

implemented by simple shift, add and invert operations. The code mapping, the

associated operations, and the simplified multiplier are shown in Fig. 5. The neg

operator stands for bit-inversion. 1-bit carry-in in 2’s complement can be absorbed as

a carry-in (shaded in gray) in the following adders or simply be discarded as a

quantization error on LSB, which can be recovered by wordlength optimization.

The shifter has no direct area cost apart from routing, while the cost of inverters

and multiplexers is relatively low because they are simple operations. Overall, it is

possible to implement one complex multiplier with 6 adders + inverters and

multiplexers, resulting in a total 40% area reduction compared to traditional approach

(area is estimated by Synopsys Design Compiler). This implementation does not

imply that we have to force the use of Gray coding in the constellation plane; the Gray

coding is only used inside the sphere decoder to simplify metric calculation and

candidate enumeration. The decoded symbols can be converted into any arithmetic

representation at the sphere decoder outputs.

Gray code 000 001 011 010 110 111 101 100

value -7 -5 -3 -1 1 3 5 7

operation -7 -5 -3 -1 1 4-1 4+1 8-1

11

S1 S0

<<21

0

neg

<<1

0

1

1

-1

x4

x8

1

0

neg

S2

S0

S1 S0

R0

1

S1 S0

∩

Fig. 5. Gray code representation and the simplified multiplier.

B. Architecture Tradeoff

In the prior work, two major types of tree search methods are reported: depth-first

(DF) [23] [24] and K-best [16]-[22]. The depth-first algorithm starts the search from

the root of the tree and explores as far as possible along each branch, then it

back-traces until a leaf node is found. The K-best algorithm approximates a

breadth-first search by keeping only K branches with the smallest partial Euclidean

distance (PED) at each level [26], which is similar to the M-algorithm in trellis

decoding [27]. The major advantages of DF are that the ML performance can be

achieved, and that radius shrinking can be used for tree pruning. On the other hand,

the advantages of K-best are its uniform data path and constant throughput.

Further examining details, depth-first ensures the ML performance if complete

solution space is explored. This might not be feasible in practice, however, because of

limited buffer size and processing cycles. This means that some termination schemes

should be used and thus ML performance is no longer guaranteed. Since the default

input is uncoded data, achieving a sub-optimal performance while keeping constant

throughput is more important. Then, space-time codes or error correction codes can be

used to improve the performance. The iterative decoding scheme which combines

MIMO decoder and error correction code decoder was proven to achieve

near-capacity performance [2].

In hardware implementation, depth-first is realized in a folding-like architecture

because only one node is visited at a time during the tree search process. In this case,

an extra memory to record the visited nodes is required, for the trace-back operation.

K-best is realized in a multi-stage pipelined way, because no trace-back is needed. To

process K data paths at the same time, parallel architecture is applied. Figure 6

illustrates the basic architectures of these two search schemes, and Table 2

summarizes their comparison in terms of circuit metrics and algorithmic performance.

12

(b) K-best (parallel and multi-stage)(a) Depth-first (folding)

PE

PE

1

PE

2

PE

M...

Fig. 6. Basic architecture of (a) depth-first and (b) K-Best algorithm.

TABLE II

COMPARISON OF DEPTH-FIRST AND K-BEST ALGORITHM

Area Throughput Latency Radius Shrinking

/Tree Pruning Performance

Depth-first Small variable long Yes ML

K-best large Constant short No Near-ML

For the sphere decoder operating with a large antenna array, the biggest challenge

in the implementation is reducing area of the design. Using the number of (complex)

multipliers as a first order area estimate, the number of multipliers needed in the

folding and multi-stage architectures are M and M(M+1)/2, respectively, where M is

the number of transmit antennas. Expanding a 44 system to a 1616 system, relative

area increases from 4 to 16 for the folding architecture and 10 to 136 for the

multi-stage architecture. The folding architecture is 2.5 to 8.5 more area efficient

compared to the multi-stage architecture, as shown in Fig. 7 (a). To keep the area

within a reasonable value, folding technique is considered. The second design

challenge is operating frequency for the folded architecture.

As the array size increases, the number of operands in the Multiply-Accumulate

(MAC) operation in the metric calculation unit increases proportionally to the number

of antennas. Assuming a tree adder design, the critical path delay roughly increases

linearly with the number of transmit antennas. However, the time required to finish

the MAC operation should be scaled down by the number of antennas in order to

increase the throughput proportionally to the number of antennas. This timing

requirement for a fixed bandwidth is shown in Fig. 7 (b). The situation is actually

worse when metric enumeration is included in the loop. Since pipelining in the loop is

considered a difficult task, this architecture can not operate at a high frequency even

for a 44 system [24].

To facilitate pipeline insertion, inputs are up-sampled by a factor m, and then one

register can be replaced with m pipeline registers in the loop using Noble Identity [42].

In this case, only one out of m samples is useful data, and the rest could be repeating

13

values of an original sample or padding zeros. By applying data-stream interleaving,

samples of other independent data streams can be introduced in the loop in place of

the repeated values or padding zeros. Technique of interleaving is therefore used to

improve area efficiency through logic sharing and to provide flexibility needed to

support varying number of data sub-carriers. In a multi-carrier communication system,

data streams are transmitted over narrow-band sub-carriers [28].

8x8

Are

a

16x164x4

x8.5

x3.5x2.5

Antenna array size8x8

De

lay

16x164x4

timing requirement

critical path in the loop

Timing gap

Antenna array size

(a) area reduction using folding technique (b) growing timing gap in folding architecture

multi-stage

folding

Fig. 7. Design challenge and tradeoff for large antenna size. Impact of antenna array size on (a) area

and (b) critical path delay.

C. PE structure

The function of the PE is to find the si with minimum Ti ( iiiiisRbT ) for each

level in the tree search, and to provide a candidate list with Ti in a descending order

since a path with smaller Ti means a higher probability to be the ML estimate. A

straightforward algorithm mapping is to enumerate all possible constellations and sort

the Ti to find the si and the candidate list [24]. The hardware cost and computational

latency of this architecture is very high for a large constellation size due to the circuit

parallelism and inevitable latency of the sorting circuit. To resolve this problem, we

propose another strategy: first, the closest point is found through the geometric

relationship since the si with minimum Ti stands for the closest point between bi and

RiiQi. The second step is to use the selected si to calculate Ti. Finally, the candidate list

is generated by the constellation arrangement, as described in Section III-C-2, Fig. 12.

We decompose the PE into two parts: Metric Calculation Unit (MCU) and Metric

Enumeration Unit (MEU). Each submodule can be mapped to Area-Energy-Delay

space to explore optimal design parameters for the top-level integration.

Decomposing a design problem along these three axes provides insight into design

techniques and their impact on power, area, and throughput. Concurrency versus

latency is one of the basic tradeoffs that need to be considered. Maximizing data

14

throughput calls for a parallel architecture, which results in a large area. Conversely,

time-multiplexing improves area efficiency, but increases latency. For example, the

decoding algorithm operating on complex numbers can be transformed into a

real-valued problem, which results in a tree that is twice as deep as the original tree

with a smaller number of children per node [16]. Since the multipliers are reused, the

number of multipliers is reduced to one half at the cost of equal throughput reduction.

Flexibility is another issue in circuit design. Ideally, the circuit should be flexible

to support different search schemes (Depth-first or K-best). In general, the overhead

of flexibility results in reduction of both energy efficiency and area efficiency. This

overhead should be minimized while maintaining system performance. Fig. 8 shows

the circuit diagram of one PE. There are m-stage pipeline registers inserted in the loop,

so the critical path can be shortened under the timing constraint by choosing a larger

m. Since m data streams are interleaved into the PE, the hardware always keeps active,

creating the maximum throughput as if the m pipeline registers are introduced without

the loop. The area overhead of the up-samplers for R can be removed if R is invariant

for each sub-carrier during one packet transmission. The flexibility of search scheme

is provided by the shift-register chain, which can be configured as forward trace or

backward trace. By placing K PEs onto one sphere decoder, K search paths are

explored at the same time to implement K-best algorithm, while each PE has

flexibility to trace back as Depth-first. The flexibility to support varying antenna size

is provided by the folding architecture. It reuses the same hardware with a higher

clock frequency as the antenna size increases. The details of sub-modules are

illustrated in the following.

sub

shift-register chain

Symbol

selection

sub

R

s

bi

. . .

myi^

m stagesadder tree

| |2

. . .

. . .

Ti

partial product

MCU

MEU

m

Rii

Fig. 8. Circuit diagram of one PE.

15

1) Metric Calculation:

Metric Calculation Unit (MCU) computes

M

ij

jijsR

1

. Basically, it executes a

Multiply-Accumulate (MAC) operation. To accumulate the maximal 16 operands and

achieve the highest throughput, there are 15 simplified complex + 1 simplified real

multipliers followed by an adder tree that merges the partial products. It is possible to

reduce the number of multipliers in a time-multiplexing manner at the price of lower

throughput [30]. For example, 4 complex multipliers can be time-multiplexed by 4 to

deploy 16 multipliers, with throughput reduced by 4. Such tuning at the architecture

level is used to position the design along throughput and power axis, with optimal

tuning of variables such as supply voltage.

Since the search process advances one stage per clock cycle, we propose an

FIR-like architecture to facilitate metric calculation, as shown in Fig. 8. If only

forward trace is allowed, the BER performance is limited by the number of parallel

processors such as in K-best algorithm. Even though more processing cycles are

provided, there is no room to improve the BER performance for K-best algorithm. By

observing that the trace-back goes back up by only one or two layers instead of a

random jump, a bidirectional shift register chain is embedded to adjust the search

depth. Since the search state is recorded in the shift registers, no extra memory, such

as stack memory, is needed to keep all the states [26] [40]. Due to the trace-back

requirement, transpose form FIR architecture is not suitable to reduce the critical path,

but the critical path is reduced by data-interleaving.

Ri,i Ri,i+1 Ri,M

. . .

Ri,i+2

. . .

si+1

si

si+2 sM

adder tree

Fig. 8. Circuit diagram of MCU.

Coefficients of R matrix are stored in memory in an area efficient way. The

diagonal terms of R matrix are real, while the rest are complex numbers. Using the

upper triangular nature, the Real part diagonal and the Imaginary part triangular data

are organized into a square memory, which saves around 50% of area.

16

2) Metric Enumeration:

The Metric Enumeration Unit (MEU) enumerates the possible constellation points

according to their Partial Euclidean Distance (PED) (

2

||

i

Mj

jT ) in an ascending order.

Exhaustive search is a straightforward implementation; it calculates the PEDs of all

constellation points and uses a sorting circuit to find the minimum one, as shown in

Fig. 10 (a). The number of distance calculation units is proportional to the

constellation size (64 units are required for 64-QAM, for example). This requirement

in itself makes hard to support a large constellation size, in addition to the extra

latency introduced by the minimum search circuit.

In the constellation plane, metric enumeration is equal to finding the points closest

to bi and scaling constellation points RiiQi from the closest to the farthest [2]. This is

the underlying principle of Schnorr-Euchner (SE) algorithm. The SE enumeration is

originally applied to one dimensional signal, such as real valued PAM or PSK

constellation; therefore it was modified to arrange QAM constellations in PQ

concentric groups to fit the original algorithm. For example, 16-QAM constellation

can be expressed as an arrangement of points in 3 concentric circles. Then the

problem is reformulated to find the closest point in each subgroup and find the closest

point over subgroup, as shown in Fig. 10 (b) [24].

si

^

RiiQ1 | |2

| |2

. .

.

| |2

min

-searc

h

bi

sub

sub

sub

. .

.

RiiQ2

RiiQk

si

^

min

-searc

h

bi

PSKALU 1

PSKALU 2

PSKALU PQ

. .

.

. .

.

bi

Region decision

Region decision

si

^

real part

imag. part

Rii

Rii

(a) exhaustive search (b) SE enumeration for QAM

(c) region partition search

RiiQi

Q

I

bidecision

boundary

Fig. 10. Closest point selection scheme: (a) exhaustive search architecture, (b) SE enumeration for

QAM, (c) region partition based search approach. Real value is represented by gray line.

The original algorithm [2] uses phase relationship to find the closet point in a

concentric circle. This approach is not suitable for hardware implementation, so [24]

17

proposed a decision boundary based method to simplify the SE enumeration. One

type of decision boundary is set by straight lines passing through the origin and the

middle point between two adjacent constellation points in a concentric circle, to

specify the starting point. Another type of decision boundary is set by straight lines

passing through the origin and the middle point between two constellation points

around the starting point in a concentric circle, to determine the initial search direction.

However, this simplification is only applicable to small size constellations such as

16-QAM. Larger constellation sizes are hard to support for several reasons. First, the

decision boundary algorithm is quite complex–many multiplications are needed to

generate the decision boundaries. Second, the number of subgroup grows quickly,

which increases the latency of the min-search circuit. For example, 64-QAM is

decomposed into 9 subgroups. Third, the concentric group partition is scalable as

QAM constellation size changes, thereby making the architecture infeasible to support

different modulations.

We propose a simple partition method based on Cartesian coordinates. The

constellation plane is partitioned into 64 regions for 64-QAM (8 regions in the Real

part and 8 regions in the Imaginary part). The closest point (with minimum distance)

can be decided by the location of bi/Rii since real part and imaginary part can be

decoded separately, as shown in Fig. 9 (c). In fact, this idea is also applied to symbol

decision. For instance, to make a decision on a QPSK system, we do not need to

calculate the distances from received signal to 4 constellation points. Instead, we just

need to examine the sign of real and imaginary parts.

The area complexity of the three architectures in Fig. 9 is evaluated using the

number of add-equivalent operators (add, subtract, compare) as area estimation. For

64-QAM constellation, exhaustive search needs 64 subtractors, 64 square operators,

and a min-search circuit. Assuming the square operators are simplified to absolute

operators with a little performance loss [24] and that min-search uses a serial

comparison circuit, total 192 adder equivalent operators are need. SE enumeration for

64-QAM needs 64 boundary decision comparisons and min-search across 9 subgroups,

so 73 add-equivalent operators are need assuming the boundary is given. The

proposed region partition search needs 8 comparators for real part and 8 for imaginary

part, which is only 16 add-equivalent operators. Therefore, 4.6 area reduction is

achieved compared to SE enumeration for 64-QAM and 32 compared to exhaustive

search.

Similar concept is applied in delay comparison: the number of adder delays is

18

used as delay estimation metric. Here, we assume the delay of min-search circuit is

equal to n2

log , where n is the number of sorting elements. However, a serial

comparison circuit needs n adder delays to finish the comparison, so a more area

consuming parallel architecture should be used to reduce the delay. The delay of

exhaustive search is approximated by the sum of delay of 1 adder, 1 absolute, and

64log2

, which is equal to 8. Delay of SE enumeration is equal to 1 operators plus

9log2

= 5. Our design needs only 1 comparator, which is 1/5 the delay of the SE

enumeration without pipelining.

TABLE III

AREA AND DELAY COMPARISON

Exhaustive SE enumeration Our work

Area (normalized) 192 73 16

Delay (normalized) 8 5 1

One challenge in the MEU implementation is that a divider or an inverse operator

seems inevitable to calculate bi/Rii, which usually introduces a longer latency and

higher hardware complexity. The property that diagonal element Rii of R matrix is real

simplifies the problem, but still introduces hardware overhead. One possible method

is to calculate 1

iiR in the preprocessing stage, since these values are updated at a

packet rate [16] [21]. If 1

iiR is given, only one multiplier is needed. In our approach,

we can demonstrate that this inverse operation is not necessary.

Instead of deciding 1

iiiRb in the constellation plane, it is equivalent to deciding bi

in a constellation plane scaled by Rii. The decision boundary (db) is denoted as

}6,4,2,0,2,4,6{ db , then we simplify 1

iiiRb to

iiRdb calculation. It may

seem that replacing one multiplier with 6 multipliers in order to execute the boundary

comparison in a parallel way may not be a good tradeoff from the area standpoint.

However, a careful examination reveals a large multiplier is replaced with small

multipliers, and that these small multipliers can be simplified as shift-add operators.

Therefore, only one adder is needed to implement ii

Rdb (iiiiii

RRR 246 );

others can be implemented by hard-wired shifting and inversion. The negative value

can be computed by bit-inversion without the carry-in bit, because carry bit appears as

negligible quantization error from the signal decision perspective. The area reduction

is quite high. If the wordlength of 1

iiR is L, then the multiply operation with large

WL )]([i

bWLL is replaced with add operation which also has smaller number of

bits ]3[ L . The simple region decision circuit is shown in Fig. 11.

19

>6Rii

>4Rii

>2Rii

>0

>-2Rii

>-4Rii

>-6Rii

5

3

1

-1

-3

-5

s[2] s[1] s[0]

constellation

size

Rii Sign

Real{bi}

/imag{bi}

7

Symbol

remapping

Fig. 11. Region decision circuit.

An extra symbol remapping block is inserted at the end to remap constellation

points if different constellation size is used. Decision outputs are mapped to Gray

code directly without extra 2’s complement representation and Gray code

transformation. Table 4 shows the mapping rules. Although Rii can be chosen always

positive to simplify this circuit further, we leave the flexibility of supporting negative

value as well in order to relax QR decomposition processing. With the proposed

approach, no sorting is needed and it is easy to expand to a large constellation size.

Additionally, the use of bit-level arithmetic results in only linear complexity increase

as the constellation size grows exponentially.

TABLE IV

SYMBOL REMAPPING AND DECISION

64-QAM 16-QAM QPSK/BPSK

s[1:0]

00 00 00

11

00

10 01 01 01 01

11 11 11 11

10 10 10 10 10

real imag

s[2] s[1] s[0] s[2] s[1] s[0]

64-QAM (6-bit)

16-QAM (4-bit)

QPSK (2-bit)

BPSK (1-bit)

After finding the closest point, remaining candidates are also decided by the

distance between bi and constellation points in an ascending order. The decoded

symbol si is used to enumerate remaining candidates through geometric relationship

rather than sorting either in trace-back or parallel search mode. The complexity of the

20

sphere decoding algorithm is independent of the lattice constellation size [48];

therefore, we can enumerate the adjacent possible constellation points instead of the

whole constellation plane. We extract 9 points in the constellation plane as illustrated

in Fig. 12. Eight surrounding constellation points have either 1-bit error (Fig. 12 (a-b))

or 2-bit errors (Fig. 12 (c-d)) if Gray coding is used. The 2nd

closest point for each

solution set is decided based on decision boundaries indicated by the dashed lines in

Fig. 12 (a), (c). The remaining points are decided by the search direction, which is

specified by other decision boundaries, starting from the 2nd

point, as shown in Fig. 12

(b), (d). These two decision boundaries are easy to calculate by sign check and

comparison for {Re} and {Im}. The search sequence of each group is well-defined,

but the boundary between these two groups is not easy to calculate. For example,

which 3rd

search point in these two groups, Fig. 12 (b) and (d), is the closer point can

not be decided by a simple boundary. Therefore, we adopt a mixed method: the two

solution sets are compared to find the final enumeration sequence with respect to the

central point.

110110

110111

110101 111101

111111

111110 101110

101111

101101

real part

bi

Riisi

Imag. part

110110

110111

110101 111101

111110 101110

101111

101101

111111

110110

110111

110101 111101

111111

111110 101110

101111

101101

110110

110111

110101 111101

111111

111110 101110

101111

101101(a) (b)

(c) (d)

110110

110111

110101 111101

111111

111110 101110

101111

101101

#1 #2

#3

#4

#5

#2

#1

#2 #2

#3#4

#5

1 bit error

subset

2 bit errors

subset

2nd

closest point 3rd

to 5th points

Fig. 12. Candidate enumeration scheme. Decision boundaries are dashed lines in the central region.

Fig. 13 shows the overall area reduction for one PE. An overall 20 area reduction

is achieved through various signal processing and circuit techniques, from arithmetic

stage down to circuit stage. If 16 sub-carriers are processed through data-stream

interleaving, the equivalent area reduction would be more than 260 times. So far, we

have built a one-PE sphere decoder. To speed up the search and improve error

probability, multiple PEs need to be utilized to span the search range. A multi-core

architecture is proposed to cooperate all the functional blocks in a power and area

efficient way.

21

folding simplified multiplier

memory reduction

wordlenghreduction

initial

x8.5

20%5%

20%

total 20xreduction

Are

a

MEUsimplfication

30%

Fig. 13. Summary of area reduction for one PE.

D. Multi-Core Architecture

Multiple-PE architecture inherently improves the search speed by the number of

processing elements. However, the search speed is further increased since the shorter

paths can be found earlier thereby pruning the tree more efficiently. In addition, the

number of processing elements offers the flexibility to trade performance with area.

Virtually all K-best architectures use parallelism to search several branches at the

same time [16]-[22], but they do not take advantage of the important features of

sphere decoding—radius shrinking and tree pruning.

When the search paths run outside the search radius, they should be discarded

instead of continuing with a deeper search. Intuitively, we should assign a new search

branch within the search radius to the processors whose search paths outside the

search radius. To maximize the probability of finding the ML estimate, the children of

the branch with smaller Euclidean distance for that level are assigned as the new

search candidates. Therefore, the functions needed include: (1) sorting circuit to

record the branch with minimum Euclidean distance, (2) radius checking block to

examine if the Euclidean distance is larger than the search radius, and (3) candidate

enumeration circuit, illustrated in Fig. 12. Since the radius checking block is included

in the sphere decoder, one of the many algorithms for effective radius shrinking can

be utilized [2] [3] [10] [12].

1) Sorting Circuit:

Sorting algorithms are extensively studied in computer science. In hardware,

several architectures are widely used: serial sorting, parallel sorting (Batcher sorter)

and Single Instruction Multiple Data (SIMD) architecture [16][33-36]. Serial sorter

executes the bubble sorting algorithm [16]. The serial comparison nature results in a

longer latency. Parallel sorter is widely used in packet switch networks sorter, which

22

makes use of parallelism to speed up sorting at the cost of increased area. SIMD

provides the largest flexibility, but its interconnect network is very complex. A

comparison of these architectures is summarized in Table V. For N inputs, Nn2

log .

Latency and Area are estimated as the number of comparator delays and the number

of comparators, respectively.

TABLE V

SUMMARY OF SORTING CIRCUITS

Serial Parallel SIMD

Latency N n(n+1)/2 n(n+1)/2

Area N/2 (n2+n)N/4 N/2

Routing complexity Low Medium High

Area is the first priority in the design of sorting circuit, because the sorting circuit

needs to be replicated to support multiple sub-carriers. Leveraging the

data-interleaving operation, N − 1 time slots are available for additional sub-carriers,

which makes serial comparison possible within a symbol period. Therefore, serial

sorter is selected in our design. Since the first input is loaded into the register of the

first stage, the latency is N − 1 cycles (one cycle saved). Fig. 14 shows the circuit of

serial sorter. For each comparator, the larger operand is sent to the lower branch and

the smaller one is sent to the upper branch. The final sorted Euclidean distance from

each PE can be used for outer receiver for iterative decoding.

comp-

areH

L

stage 1

comp-

areH

L

stage 2

...comp-

areH

L

stage M/2

Fig. 14. Circuit diagram of a serial sorter.

2) Radius Checking:

Radius checking is executed with parallel sorting. Euclidean distances stored in all

PEs are checked serially. If the Euclidean distance is larger than the search radius, a

new search path is assigned. On the other hand, if the Euclidean distance is smaller

than the search radius, then the search radius is updated to this smaller value and the

corresponding branch is chosen as the ML estimate.

A multi-core architecture is proposed to coordinate all functional blocks. The

number of PEs are decided from BER-are-power tradeoff. A 16-PE architecture is

shown in Fig. 15. For each PE, the decoded symbols and the associated Euclidean

23

distance for 16 sub-carriers are fed into registers serially after processing. For each

cycle, only the metrics of one sub-carrier are computed, while other sub-carriers

conduct sorting, radius checking, and candidate enumeration across PEs. A sorting

circuit connecting 16 registers belonging to the same sub-carrier is embedded. Radius

checking is conducted serially using a multiplexer, and followed by a new path

assignment conditionally.

PE

SC-1 SC-2 SC-3 SC-4

SC

-5S

C-6

SC

-7S

C-8

SC-9SC-10SC-11SC-12

SC

-13

SC

-14

SC

-15

SC

-16

Demux

MEU

MemoryMCU

I/O

In

terf

ace radius

checking and

updating

PE-1 PE-2 PE-3 PE-4

PE

-5P

E-6

PE

-7P

E-8

PE-9PE-10PE-11PE-12

PE

-13

PE

-14

PE

-15

PE

-16

Sub-carrier space

MuxPE-1

PE-5

PE-13

PE-2 PE-3 PE-4

PE-6

PE-7 PE-8

PE-9PE-10PE-11PE-12

PE-14

PE-15

PE

-16

Fig. 15. Multi-PE sphere decoder architecture.

With this compact multi-PE architecture, the sphere decoder provides a very high

performance. At 256MHz, each PE provides 46.5GOPS (12-bit equivalent add), and

total operations for 16-PE architecture amount up to 800GOPS (including sorting and

radius checking circuits) for the whole system when operating at the 1616 64-QAM

mode. In addition to high performance, flexibility and scalability are also included.

We illustrate the design specifications next.

E. Design Specifications

The sphere decoder is designed to support different system configurations with

respect to antenna array size, modulation and detection schemes, as well as the

number of sub-carriers. Table 6 summarizes the configuration modes. Since varying

antenna array size and modulation are supported, this design is also capable of trading

off diversity gain for spatial-multiplexing if STBC is used. Due to interleaving by 16,

the supported number of sub-carriers can be a multiple of 16 through data

rearrangement.

TABLE VI

OVERVIEW OF SYSTEM CONFIGURATION MODES

Configuration Modes

Antenna array size Any square matrix # b/w 22 − 1616

Modulation BPSK, QPSK, 16-QAM, 64-QAM

# sub-carriers 16, 32, 64, 128

Detection Depth-first, K-best

24

Main design specification is the throughput constraint for the algorithm. Since

total 16 MHz bandwidth is used, each sub-channel requires 1MS/s to process the data

in the case of 16 sub-carriers. The requirement is thus to process 16 parallel streams

of data at a 1MHz rate. Clock specification for the resulting architecture then becomes

256 MHz (1MHz 16 sub-carriers 16 antennas). Notice the clock frequency of all

modes can achieve 256MHz. The clock frequency for smaller array size is reduced

due to a fixed channel bandwidth. Detailed system specifications are listed in Table 7

for array size 44 to 1616. We see the system supports ideal throughput up to

1.536Gbps, which results in a spectral efficiency up to 96 bps/Hz. When the system is

operated at a smaller array mode, clock frequency and supply voltage can be reduced

to minimize power consumption.

TABLE VII

SUMMARY OF SYSTEM SPECIFICATION

Antenna array 44 88 1616

Modulation QPSK 16-QAM 64-QAM QPSK 16-QAM 64-QAM QPSK 16-QAM 64-QAM

BW (baseline) 16 MHz

Clock freq. 64MHz 128MHz 256MHz

Throughput (bps) 128M 256M 384M 256M 512M 768M 512M 1.024G 1.536G

Spectral Efficiency

(bps/Hz) 8 16 24 16 32 48 32 64 96

A comparison of hardware is illustrated in Table 8. The estimated chip area is 0.55

mm2 in a standard 90 nm CMOS process using the approximation of 10,000 FPGA

slices 1 mm2 layout area in 90 nm CMOS [28]. To make a fair comparison, the

area is normalized by the number of transmit antennas (this is a conservative estimate,

because the hardware complexity could grow quadratically with the number of

transmit antennas). The data indicates that the proposed architecture is the most area

efficient compared to prior work. Furthermore, our design outperforms all previously

published designs in terms of supported antenna array size and constellation size, as

shown in Fig. 16. Unlike previous work, the proposed architecture also supports

multiple sub-carriers and search methods. Finally, this is the first design that offers

the flexibility required to fully traverse the diversity vs. spatial-multiplexing tradeoff

curve.

TABLE VIII

HARDWARE COMPLEXITY COMPARISON

[19] [17] [21] [22] [24] This work

Area 500k

GC

10 mm2

(0.18um)

12.7

slices

97k

GC

50k

GC *

Area (norm.) 6.5 2.5 9.2 18.2 1.3 1

*154k gate count (GC), 0.55 mm2 (90nm), or 5.5 slices

25

BPSK QPSK 16QAM 64QAM

4x4

8x8

16x16

Modulation

An

ten

na

arr

ay s

ize This work

[17][21][22][24]

[19]

Fig. 16. Comparison of this work with previous work.

IV. DESIGN METHODOLOGY

An integrated design methodology is adopted in our work to incorporate algorithm,

architecture, and circuit implementation in a highly automated environment. Since the

design is complex, we start with a layered design approach which decomposes the

whole system from the top architecture down to the fundamental modules

hierarchically. Different considerations such as area and throughput are evaluated at

each layer for architecture optimization. A graphical Simulink/Matlab development

environment offers bit-true, cycle-by-cycle hardware equivalent modules for

simulation, and then translates to FPGA emulation without hardware description

language (HDL) coding. Due to the limited capacity of single FPGA, BEE2 platform

[38] is used to accommodate the whole system and speed up emulation.

A. Simulink-Based Design Environment

We use Simulink/Matlab design environment [44]. Traditionally, circuit design for

communication signal processing is divided into two stages: algorithm design and

circuit implementation. Algorithm designers use C/C++ or Matlab for system

simulation, and then the designed architecture is implemented by circuit designers

using HDL. There are usually several iterations between two design stages to ensure

the final design satisfies the specifications. In this work, Xilinx System Generator

(XSG) block-sets are used to build hardware equivalent modules, which leverages

cycle-accurate software simulation. In addition, quantization effects due to finite

wordlength are considered in the simulation. Area information is extracted by

resource estimator (XSG) or design compiler (Synopsys) in terms of number of slices

or area in the early design stage, since the equivalent HDL can be generated

automatically. The drawback of simulink-based design flow is its lengthy simulation

time, which can be mitigated by FPGA-based hardware emulation [43].

26

B. Emulation Platform

FPGA-based hardware emulation and rapid prototyping have become an attractive

solution, which can provide up to 106 times faster simulation speed than software

simulation [37]. Xilinx University Program (XUP) board (Virtex-2 Pro 30 part) [50] is

used to develop the hardware/software cosimulation environment for small circuits. In

this case, the hardware modules built in the Simulink is replaced with the configured

FPGA to speed up simulation. Due to the limited capacity of XUP board, BEE2

platform is used for whole system emulation.

The BEE2 consists of 5 Vertex-2 Pro 70 FPGAs (~10M equivalent logic gates

total). Each FPGA embeds a PowerPC core which minimizes the latency between the

microprocessor and reconfigurable logic. Four FPGAs (user FPGA) are used for

computation and one for control (control FPGA) as shown in Fig. 17. With high speed

bandwidth, low latency links, BEE2 provides a virtual single FPGA of five times the

capacity [38].

User FPGA-1

User FPGA-4

User FPGA-2

Ctrl FPGA

User FPGA-3

Fig. 17. BEE2 emulation platform.

C. Simulation Results

The BER performance of one PE is verified through the hardware/software

co-simulation environment. In this preliminary experiment, ZF-DFE/BLAST

algorithm is adopted, i.e. for each level of the search tree topology, only the closest

lattice point is chosen as the decoded symbol [41] [49]. Since only a small portion of

the solution space is examined, there exists a performance gap between this scheme

and ML solution. However, we demonstrate a system with a larger antenna array and

repetition coding can outstrip the ML performance with a smaller antenna array easily.

The BER performance can be further improved to achieve ML performance without

repetition coding by using multiple PEs, which is being designed.

Fig. 18 (a) shows the BER performance of 64-QAM modulation for different

27

antenna array sizes and different repetition coding rates. The repetition coding here is

referred to sending replicas in space domain to reduce error probability. We see the

performance of 44, 88, and 1616 is comparable, but the throughput is different

given a fixed bandwidth. By using repetition coding by a factor 2, the throughput will

drop to one half, but the BER performance is improved. Therefore, the throughput of

44, 88 with repetition coding by 2, and 1616 with repetition coding by 4 is the

same, but the BER performance is improved significantly. The performance of the ML

estimate is depicted in Fig. 18 (b) as a reference, which is the information theoretic

bound for a 44 system. An 88 system with repetition coding by 2 has outperformed

the 44 with the ML performance by 5dB.

0 5 10 15 20 2510

-6

10-5

10-4

10-3

10-2

10-1

100

Eb/No (dB)

BE

R

16x16 64QAM

8x8 64QAM

4x4 64QAM

8x8 64QAM, rep. x2

16x16 64QAM, rep. x4

(a)

0 5 10 15 20 2510

-6

10-5

10-4

10-3

10-2

10-1

100

Eb/No (dB)

BE

R

4x4 16QAM

4x4 16QAM ML

8x8 16QAM, rep. x2

16x16 16QAM, rep. x4

(b)

Fig. 18. Selected BER performance of one PE.

28

V. CONCLUSION

This work proposed a flexible sphere decoder architecture for extracting diversity

and spatial multiplexing gains for MIMO communications. The choice of architecture

is evaluated in terms of hardware complexity, throughput requirement, and flexibility.

To keep the sphere decoder with large dimensional signal processing feasible, design

tradeoffs are jointly considered at algorithm, architecture and circuit layers.

Several signal processing and circuit techniques are used to reduce hardware

complexity while keeping high throughput requirement. We start from multiplier

simplification by numerical strength reduction and numerical representation

manipulation. A folding and interleaved architecture is used to reduce area and break

down critical path simultaneously. Also, hardware reuse provides flexibility to support

varying antenna array. By taking advantage of bi-directional shift register chain, huge

memory usage to store search states is saved and backward trace is offered. Metric

enumeration is implemented using an efficient boundary decision algorithm and

associated simplified circuit, which supports multiple modulations with a negligible

area cost. Finally, a multi-core architecture is proposed to coordinate all processing

elements. This unique architecture interleaves multiple sub-carrier data streams and

searches possible paths in parallel for each sub-carrier. Meanwhile, processing

intervals are used for radius check and update serially, creating a very compact

realization for sphere decoding algorithm.

An overall 20 times area reduction is achieved for each processing element. The

proposed flexible architecture supports antenna arrays ranging from 22 to 1616,

modulations from BPSK to 64QAM, over 16 to 128 sub-carriers. With an 800 GOPS

processing capability, the peak estimated data rate exceeds 1Gbps over a 20MHz

channel.

VI. FUTURE WORK

The flexible architecture is designed to implement various search algorithms, but

the optimal algorithm to fully utilize all PEs is still being developed. Thanks to BEE2

platform enabling rapid prototyping to verify the proposed algorithms, the simulation

time can be shortened by several orders of magnitude. Soft information generation is

another important function which can be combined into the proposed sphere decoder

[31]. The ultimate goal of this work is to develop an ASIC which is the optimal

architecture in the energy-area-delay space. A sensitivity-based optimization

framework will be adopted to balance design variables for blocks in the same

abstraction layer of hierarchy as well as across layers [29]. Starting from the initial

29

realization, the design can be further optimized through wordlength optimization, gate

sizing, and VDD optimization. Finally, the fabricated chip will be tested using an

efficient FPGA-based ASIC verification flow [6]. The proposed timeline with

accomplished and remaining works is plotted in the Fig. 19.

09/06

XUP implementation

BEE2 emulation

Algorithm space exploration

Archietecture space exploration

ASIC implementation

Chip verification

Thesis writing

01/07 05/07 09/07 01/08 05/08 09/08 01/09 05/09

Chip testing platform

11/13

Fig. 19. Proposed timeline.

REFERENCES

[1] E. Zimmermann, W. Rave, and G. Fettweis, ―On the Complexity Decoding,‖ in Proc.

International Symposium on Wireless Personal Communication (WPMC’04), vol. 2, pp.

500–504, Sept. 2004.

[2] B. M. Hochwald and S. ten Brink, ―Achieving near-capacity on a multiple-antenna

channel,‖ IEEE Trans. Comm., vol. 51, pp. 389–399, Mar. 2003.

[3] G. B. Giannakis, Z. Liu, X. Ma, and S. Zhou, Space-Time Coding for Broadband

Wireless Communications, John Wiley & Sons, 2007.

[4] D. Tse and P. Viswanath, Fundmentals of Wireless Communication, Cambridge

University Press, 2005.

[5] L. Zheng and D. Tse, ―Diversity and Multiplexing: A Fundamental Tradeoff in

Multiple-Antenna Channels,‖ IEEE Transactions on Information Theory, vol. 49, no. 5,

pp. 1073–1096, May 2003.

[6] D. Markovic, A Power/Area Optimal Approach to VLSI Signal Processing, PhD Thesis,

University of California at Berkeley, 2006.

[7] F. Edman, Digital hardware aspects of multiantenna algorithms, PhD Thesis, Lund

University, 2006.

[8] A. Maltsev, V. Pestretsov, R. Maslennikov, and A. Khoryaev, ―Triangular systolic array

with reduced latency for QR-decomposition of complex matrices‖ Proc. IEEE

International Symposium on Circuits and Systems (ISCAS’06), pp. 385–388, May 2006.

30

[9] L. M. Davis, ―Scaled and decoupled Cholesky and QR decompositions with application

to spherical MIMO detection, in IEEE Wireless Communications and Networking

(WCNC’03), vol. 1, pp. 326–331, Mar. 2003.

[10] J. Lee, S. Park, Y. Zhang, K. K. Parhi, and S.-C. Park, ―Implementation Issues of A List

Sphere Decoder, ― in Proc. International Conference on Acoustics, Speech and Signal

Processing (ICASSP’06), vol. 3, pp.996–999, May 2006.

[11] M. K. Varannasi, ―Decision Feedback Multiuser Detection: A Systematic Approach,

― IEEE Transactions on Information Theory, vol. 45, no. 1, pp. 219–240, -Jan. 1999.

[12] W. Zhao and G. G. Giannakis, ―Sphere decoding algorithms with improved radius

search,‖ IEEE Transactions on Communications, vol. 53, no. 7, pp. 1104–1109, July

2005.

[13] V. Tarokh, N. Seshadri, and A. R. Calderbank, ―Space-Time Codes for High Data Rate

Wireless Communication: Performance Criterion and Code Construction,‖ IEEE

Transactions on Information Theory, vol. 44, no. 2, pp. 744–765, Mar. 1998.

[14] V. Tarokh, H. Jafarkhani, and A. Calderbank, ―Space-Time Block Code from

Orthogonal Designs,‖ IEEE Transactions on Information Theory, vol. 45, no. 5, pp.

1456–1467, July 1999.

[15] D. Gesbert, M. Shafi, D.-S. Shiu, P. J. Smith, and Ayman Naguib, ―From Theory to

Practice: An Overview of MIMO Space-Time Coded Wireless Systems,‖ IEEE Journal

on Selected Areas in Communications, vol. 21, no. 3, pp. 281–302, Apr. 2003.

[16] K.-W. Wong, C.-Y. Tsiu, R. S.-K. Cheng, and W.-H. Mow, ―A VLSI architecture of a

K-best lattice decoding algorithm for MIMO channels,‖ in Proc. IEEE International

Symposium on Circuits and Systems (ISCAS’02), vol. 3, pp. 273–276, May 2002.

[17] D. Garrett, L. Davis, S. ten Brink, B. Hochwald, and G. Knagge, ―Silicon complexity for

maximum likelihood MIMO detection using spherical decoding,‖ IEEE Journal of

Solid-State Circuits, vol. 39, pp. 1544–1552, Sep. 2004.

[18] B. Widdup, G. Woodward and G. Knagge, ―A Highly-Parallel VLSI Architecture for a

List Sphere Detector,‖ in Proc. Int. Conf. Communications, vol. 5, pp. 2720–2725, June

2004.

[19] G. Knagge, M. Bickerstaff, B. Ninness, S. R. Weller, and G. Woodward, ―A VLSI 8x8

MIMO Decoder Engine,‖ in IEEE Workshop on Signal Processing Systems, SIPS’06, pp.

387–392, Oct. 2006.

[20] G. Knagge, G. Woodward, S. Weller, and B. Ninness, ―An Optimised Parallel Tree

Search for Multiuser Detection with VLSI Implementation Strategy,‖ in Global

Telecommunications Conference (GLOBECOM’04), pp. 2440–2444, Dec. 2004.

[21] L. G. Barbero and J. S. Thompson, ―Rapid Prototyping of a Fixed-Throughput Sphere

Decoder for MIMO Systems,‖ Proc. Int. Conf. Communications, vol. 7, pp. 3082–3087,

June 2006.

31

[22] Z. Guo and P. Nilsson, ―Algorithm and implementation of the k-best sphere decoding for

MIMO detection,‖ IEEE Journal on Selected Areas in Communications, vol. 24, pp.

491–503, Mar. 2006.

[23] K. K. Parhi, VLSI Digital Signal Processing Systems, John Wiley & Sons, 1999.

[24] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Boelcskei, ―VLSI

implementation of MIMO detection using the sphere decoding algorithm,‖ IEEE

Journal of Solid-State Circuits, vol. 40, pp. 1566–1577, July 2005.

[25] L. G. Barbero and J. S. Thompson, ―Performance of the complex sphere decoder in

spatially correlated MIMO channels,‖ IEL Communications, vol. 1. pp. 122–130. Feb.

2007.

[26] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, MIT Press,

1998.

[27] J. Anderson and S. Mohan, ―Sequential coding algorithms: a survey and cost analysis,‖

IEEE Transactions on Communications, vol. COM-32, pp. 169–176, Feb. 1984.

[28] D. Markovic, B. Nikolic, and R. W. Brodersen, ―Power and Area Efficient VLSI

Architecture for Communication Signal Processing,‖ in Proc. Int. Conf.

Communications, Jun. 2006.

[29] D. Markovic, B. Nikolic, and R. W. Brodersen, ―Power and Area Minimization for

Multidimensinal Signal Processing,‖ IEEE Journal of Solid-State Circuits, vol. 42, no. 4,

pp. 922–934, Apr. 2007.

[30] C. J. Nicol, P. Larsson, K. Azadet, and J. H. O’Neill, ―A Low-Power 128-Tap Digital

Adaptive Equalizer for Broadband Modems,‖ IEEE Journal of Solid-State Circuits, vol.

32, no, 11, pp 1777–1789, Nov. 1997.

[31] C. Studer, M. Wenk, A. Burg, and H. Bolcskei, ―Soft-Output Sphere Decoding:

Performance and Implementation Aspects,‖ in Proc. Asilomar Conference on Signals,

Systems and Computer (ACSSC’06), pp. 2071–2076, Oct. 2006.

[32] A. Wiese, X. Mestre, A. Pages, and J. R. Fonollosa, ―Efficient Implementation of Sphere

Demodulation,‖ in IEEE Workshop on Signal Processing Advances in Wireless

Communications, pp. 36–40, June 2003.

[33] J. B. Anderson and S. Mohan, ―Sequential coding algorithms: A survey and cost

analysis,‖ IEEE Transactions on Communications, vol. COM-32, pp. 169–176, Feb.

1984.

[34] N. K. Sharma, ―Modular Design of a large sorting network,‖ in Third International

Symposium on Parallel Architectures, Algorithms, and Networks, pp. 362–382, Dec.

1997.

[35] P. A. Bengough and S. J. Simmons, ―Sorting-based VLSI architecture for the

M-algorithm and T-algorithm trellis decoders,‖ IEEE Transactions on Communications,

vol. 43, no. 2/3/4, pp. 514–522, 1995.

32

[36] S. Mohan and A. Sood, ―A Multiprocessor Architecture for the (M, L) algorithm suitable

for VLSI implementation,‖ IEEE Transactions on Communications, vol. COM-34, pp.

1219–1224, Dec. 1986.

[37] C. Chang, K. Kuusilinna, B. Richards, A. Chen, N. Chan, and R. W. Brodersen, B.

Nikolic, ―Rapid design and analysis of communication systems using the BEE hardware

emulation environment,‖ in Proc. IEEE Rapid System Prototyping Workshop, Jun. 2003.

[38] BEE2: Berkeley Emulation Engine 2, [Online]. Available:

http://bwrc.eecs.berkeley.edu/Research/BEE/BEE2/index.htm

[39] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design

Perspective, 2nd ed. Prentice-Hall, 2003.

[40] F. Jelinek, ―Fast Sequential Decoding Algorithm Using a Stack,‖ IBM Journal of

research and Development, vol. 13, no. 6, pp. 575-685, Nov. 1969.

[41] W.-J. Choi, R. Negi, and J. M. Cioffi, ―Combined ML and DFE Decoding for the

V-BLAST System,‖ in Proc. Int. Conf. Communications, vol. 3, pp. 1243–1248, June

2000.

[42] P. P. Vaidyanathan, Multirate Systems And Filter, Prentice Hall, 1993.

[43] H. K.-H. So, A. Tkachenko, and R. Brodersen, ―A Unified Hardware/Software Runtime

Environment for FPGA-Based Reconfigurable Computers using BORPH,‖ in Proc. Int.

Conf. Hardware/Soft Codesign and System Synthesis (CODES+ISSS’06), pp. 259–264,

Oct. 2006.

[44] W. R. Davis, N. Zhang, K. Camera, D. Markovic, T. Smilkstein, M. J. Ammer, E. Yeo, S.

Augsburger, B. Nikolic, and R. W. Brodersen, ―A design environment for high

throughput, low power dedicated signal processing systems,‖ IEEE Journal of

Solid-State Circuits, vol. 37, no. 3, pp. 420–431, Mar. 2002.

[45] B. Hassibi, ―An Efficient Square-root Algorithm for BLAST,‖ in Proc. Int. Conf.

Acoustics, Speech, and Signal Processing (ICASSP'00), vol. 2, pp. 737–740, June 2000.

[46] L. M. Davis, ―Scaled and Decoupled Cholesky and QR Decompsitions with Application

to Spherical MIMO Detection,‖ in IEEE Wireless Communications and Networking

(WCNC’03), vol. 1, pp. 326–331, Mar. 2003.

[47] D. Pham, K. R. Pattipati, P. K. Willett, and Jie Luo, ―An Improved Complex Sphere

Decoder for V-BLAST Systems,‖ IEEE Signal Processing Letters, vol. 11, no, 9, pp.

748–751, Sep. 2004.

[48] M. O. Damen, A. Chkeif, and J.-C. Belfiore, ―Lattice codes decoder for space-time

codes,‖ IEEE Communication Letters, vol. 4, no. 5, pp. 161–163, May 2000.

[49] M. O. Damen, H. El Gamal, and G. Caire, ―On Maximum-Likelihood Detection and the

Search for the Closest Lattice Point,‖ IEEE Transaction on Information Theory, vol. 49,

no. 10, pp. 2389–2402, Oct. 2003.

[50] XUP: Xilinx University Program, [Online]. Available: http://www.xilinx.com/univ

Documents

A Flexible Sphere Decoder Architecture for MIMO …icslwebs.ee.ucla.edu/dejan/researchwiki/images/3/36/Qualsproposal... · A Flexible VLSI Architecture for Extracting Diversity