Upload
nguyennguyet
View
220
Download
3
Embed Size (px)
Citation preview
1
A Flexible VLSI Architecture for Extracting Diversity and
Spatial Multiplexing Gains in MIMO Channels
Chia-Hsiang Yang
University of California, Los Angeles
Challenges:
1. A unified solution to span the entire diversity-multiplexing tradeoff curve
2. Tradeoff between two search methods
Depth-first: ML performance with variable throughput and long latency
K-best: near ML performance with constant throughput and short latency
3. Antenna array size beyond 4×4
Area increases quadratically with the number of transmit antennas
Critical path increases linearly with the number of transmit antennas
4. Modulations beyond 16-QAM
Hardware increases quickly with the constellation size
Longer latency introduced by the minimum search circuit
5. Multiple sub-carriers
Research Contributions:
1. A unified sphere decoder architecture for extracting diversity and spatial
multiplexing gains in MIMO channels
2. Signal processing techniques to support antenna sizes up to 16×16
Folding: hardware area increases linearly with antenna array size
Loop retiming: reduces the critical path
Data interleaving: supports multiple independent sub-carriers
A region partition enumeration method for constellations up to 64-QAM
3. A flexible architecture
Antenna array: 2×2 to 16×16
Modulations: BPSK to 64-QAM
Number of sub-carriers: 16 to 128
Search method: K-best or depth-first search
4. A simplified multiplier
Numerical strength reduction
Gray coding to reduce number of operations
5. A multi-core architecture for enhanced performance
2
Abstract—Sphere decoding algorithm is widely used in MIMO communications, because of
its ability to approach maximum likelihood detection with significantly reduced
computational complexity. This makes it attractive for hardware implementation; however,
prior work focused only on solutions with fixed number of antennas or fixed modulations.
This work presents a unified sphere decoder architecture that deploys diversity-multiplexing
tradeoff in MIMO channels by taking advantage of the flexibility in the number of antennas
and modulation schemes. Several signal processing and circuit techniques are constructively
combined to reduce the hardware complexity: a 20 times area reduction is achieved even
without interleaving of subcarriers compared to direct-mapped architecture. The proposed
flexible architecture supports antenna arrays from 2×2 to 16×16, modulations from BPSK to
64-QAM, over 16 to 128 sub-carriers. The peak estimated data rate exceeds 1.5 Gbps ideal
throughput using a 16 MHz bandwidth in just 0.55 mm2 in a standard 90 nm CMOS process.
I. INTRODUCTION
Multi-input multi-output (MIMO) communication has recently received
significant attention due to its potential to increase link robustness and channel
capacity. Hardware realization of MIMO signal processing algorithms is quite
challenging, because it requires multi-dimensional, matrix-based, computations.
However, with the growing demand for higher data transmission rates over wireless
links, the need of devices equipped with multiple antennas increases.
Among various MIMO algorithms, sphere decoding is one of the most promising
solutions. It approximates the information theoretic bound, set by the maximum
likelihood (ML) detection, with several orders of magnitude lower computational
complexity [1] [2]. This means that, for a given hardware cost, the reduced
complexity could be utilized to increase the size of antenna array and effectively
improve the performance beyond the ML performance of a system with smaller array
size. The complexity reduction is achieved by transforming an exhaustive search of
the ML decoders into a tree search procedure of sphere decoding. Tree search is quite
popular in other communications areas such as multi-user detection (MUD) for
CDMA systems, block-based demodulation, and linear block error control code
decoding [3]. Other potential applications include speech recognition, data
compression, protein sequence exploration, and neural signal detection.
Sphere decoding algorithm is a multi-dimensional signal processing task dealing
with vector and matrix arithmetic. The required computation involves hundreds of add
and multiply operations, and may also need divide and trigonometric functions. Such
a high complexity limits the system specifications such as antenna array size and
3
modulations. In addition, prior work focused only on solutions with fixed number of
antennas or fixed modulations [16][17][19][21][22][24]. In this work, we evaluate the
architectures proposed in prior work and advance state-of-the-art in the area of
multidimensional matrix-based signal processing hardware. A number of signal
processing techniques [23] are considered jointly with the technology parameters to
greatly reduce hardware area (cost) and power while maximizing the performance.
This work develops an architecture that further simplifies sphere decoding
implementation by jointly considering tradeoffs at the algorithm, architecture, and
circuit layers of abstraction, with the goal of minimizing chip power and area. At the
same time, additional degrees of freedom are considered in the design in order to take
full advantage of the diversity and spatial multiplexing gains available in MIMO
wireless channels [5]. Tuning over a range of diversity-multiplexing points is possible
by varying antenna array size and modulation scheme, for example. Flexibility and
scalability are, thus, key additional requirements in the design of multi-mode,
multi-standard systems. Also, our work uses the Matlab/Simulink framework to
improve design productivity in mapping of DSP algorithms onto silicon. BEE2
platform [38] is used to verify system functionality before entering physical ASIC
design.
This proposal is organized as follows. Section II reviews the fundamental
diversity-multiplexing tradeoff in MIMO communications and describes sphere
decoding algorithm. Several signal processing techniques, evaluated in
power-area-performance space, and architecture details are presented in Section III.
Section IV describes the Simulink design environment and BEE2 emulation platform.
Conclusions are summed up in Section V. Finally, Section VI proposes future work
and the timeline.
II. ALGORITHM SPACE EXPLORATION
A MIMO system can improve the reliability of a wireless link through increased
diversity or improve the channel capacity through spatial multiplexing. Diversity gain
and spatial multiplexing gain are related to system coverage range and data rate,
respectively. Both gains can be improved using a larger antenna array. However,
given a MIMO system, there is a fundamental trade-off between these two gains [4]
[5]. In the diversity-multiplexing space, repetition code, Alamouti code, and
space-time code use data redundancy to increase diversity at the price of losing spatial
multiplexing gain. In contrast, Bell Labs Layered Space Time (BLAST) algorithm,
Singular Value Decomposition (SVD), and QR decomposition allocate data-streams
4
in different eigen-modes to maximize spatial multiplexing gain while sacrificing
diversity gain, as shown in Fig. 1.
Sphere decoding is a decoding scheme that can extract both diversity and
multiplexing gains. With flexibility in coding and modulation, sphere decoder can
effectively explore the entire tradeoff curve as shown in Fig. 1. The original data type
for sphere decoding is uncoded data. By manipulation of input data, sphere decoding
is capable of decoding space-time block codes (STBC), which improves the error
probability and increases diversity gain. The data rate can be maximized by
transmitting different modulations over different MIMO substreams to increase
spatial multiplexing gain. Also, with proper preprocessing, the decoding process starts
from decoding the symbols with highest SNR first, and then canceling the effect of
the decoded symbols for remaining symbols until the final symbol is decoded. This
decoding sequence is equivalent to that in BLAST [41]. A unified sphere decoder
model is illustrated in the following section.
Spatial multiplexing gain (rate)
Div
ers
ity g
ain
(ra
ng
e)
Sphere
decoding
array size
array size
RepetitionRepetitionAlamoutiAlamouti
SpaceSpace--timetime
BLASTBLASTSVDSVDQRQR
Spatial multiplexing gain (rate)
Div
ers
ity g
ain
(ra
ng
e)
Sphere
decoding
array size
array size
RepetitionRepetitionAlamoutiAlamouti
SpaceSpace--timetime
BLASTBLASTSVDSVDQRQR
Fig. 1. Diversity-Multiplexing tradeoff in MIMO communications.
A. Sphere Decoding Algorithm
Consider a multiple antenna system with M transmitter antennas and N receive
antennas. The received vector y can be represented by
nHsy (1)
where y is an N1 vector of received symbols, and H denotes an NM channel matrix
whose elements are i.i.d. complex Gaussian with zero mean and unit variance. Vectors
s and n (M1 and N1 respectively) represent the transmitted symbols and zero mean,
circularly symmetric white Gaussian noise, respectively. The transmitted vector
Qs with the smallest Euclidean distance is selected as ML estimate in (2). The
5
channel matrix can be decomposed further using QR factorization; the equivalent ML
estimate thus can be written as
2||||minargˆ Hsys 2||ˆ||minarg Rsy (2)
with ZF
RsyQy Hˆ
where Q is a unitary matrix, R is an upper triangular matrix, and yHH)(Hs 1
ZF
HH
is the zero-forcing (unconstrained ML) estimate. The signal model is presented in Fig.
2.
H QH
Channel
n
sy
H=QR
y
TX RX
Sphere
Decoders^
min|| -Rs||2ys=arg^
^
^
Fig. 2. Signal model of sphere decoding algorithm.
The most commonly used methods for QR decomposition are Grahm-Schmidt
decomposition, Householder transformation, and Givens rotations [7]. Several
modifications such as division free, or square-root and division reduction methods are
proposed to simplify the operation in the original algorithm [45] [46]. For hardware
realization, [8] proposed an algorithm suitable for fixed-point implementation and [9]
proposed a CORDIC-based triangular systolic array architecture to reduce latency.
Under the assumption of block fading channel, QR decomposition is computed at the
packet rate.
Using the upper triangular nature of R, the symbol decoding begins from the last
row and occurs in several steps. The decoded symbols are used for successive
decoding steps until all symbols are decoded. This decoding algorithm can be mapped
to finding a shortest path (with minimum Euclidean distance) in a tree topology – one
possible constellation point denotes one node, each row of the R matrix is mapped to
each level of the tree whose edges are weighted by channel coefficients. The whole
solution space of this tree is equivalent to exhaustive search in the trellis diagram of
the original problem; number of total combinations of transmitted symbols is |Q|M
,
where |Q| is the constellation size.
By properly choosing a search radius and a search method, the ML solution can be
approached by visiting only nodes within a hyper-sphere, rather than performing an
6
exhaustive search. This complexity reduction is feasible, because the Euclidean
distance is a cumulative sum of square terms. This means that for each node, if its
Euclidean distance is larger than the search radius, the corresponding branches are
outside the search radius as well. The conceptual view of sphere decoding algorithm
is illustrated in Fig. 3. Tree pruning technique makes sphere decoding achieve ML
performance with polynomial complexity (highlighted nodes in Fig. 3) rather than
exponential complexity (all nodes in Fig. 3) [1].
. . .
. . .
. . .
...
...
...
...
...
...
constellationsize
ant-M
ant-2ant-1
search radius
Fig. 3. Concept of sphere decoding. Unlikely nodes and branches are indicated with gray shade.
B. Performance Improvements
Several simple yet effective methods such as detection ordering, candidate
enumeration and search radius setting are applied to improve error performance
and/or reduce the complexity the basic sphere decoding [3]. For instance, the sphere
decoding algorithm for 44 64QAM system as compared to exhaustive search results
in over 105 times reduction in computational complexity [10].
1) Detection Ordering: The idea behind detection ordering is to detect symbols with
the largest SNR first: to avoid discarding the ML solution, the first decoded symbols
should be the most reliable. Various ordering algorithms have been proposed for the
preprocessing stage: V-BLAST-ZF ordering, V-BLAST-MMSE, and Norm ordering [3]
[25]. Assuming a packet-based wireless communication system, the ordering only
needs to be performed once at the beginning of each received frame.
2) Candidate Enumeration: Detection ordering is applied across levels in the tree
topology. For each level, the order of constellation point enumeration is another
important factor to improve search speed. Schnorr-Euchner (SE) enumeration
suggests traversing the constellation candidates according to the cumulative distance
increment in an ascending order [2]. Therefore, the first candidate si for each row is
the one with minimum distance between bi and RiiQi in (3). Finding a good admissible
7
solution early means that we can shrink our initial radius early.
iiii
M
ij
jijisRbsRy
ˆ with
M
ij
jijiisRyb
1
ˆ . (3)
3) Search Radius Setting: One major feature of sphere decoding is the radius
shrinking. Once a solution is found with a smaller Euclidean distance, the search
radius is updated to this value so that more unlikely branches can be pruned. However,
the initial choice of search radius is not easy for sphere decoding, because the choice
of search radius influences the complexity of the algorithm. When search radius is too
large, a very high number of visited nodes is in the solution space which causes high
detection complexity. Conversely, when the search radius is too small, this may result
in an empty sphere and no available solution.
Based on AWGN model, sum of noise square is central-chi-square distributed
with 2M degrees of freedom [11] [47]. Given the channel SNR, the search radius can
be decided by solving the probability density function (pdf) with a confidence interval.
If channel SNR is unknown, the Euclidean distance of zero-forcing solution can be
used as an initial guess. Algorithm with increasing search radius was proposed, which
starts the search with a strict search radius first, and expands the search radius if no
solution is available within the radius [12] [48].
C. Tradeoff in Diversity-Multiplexing Space
A unified sphere decoder architecture is illustrated here for extracting diversity
and spatial multiplexing gains along the tradeoff curve. We demonstrate that adding
flexibility in varying antenna size and varying modulations is the key features for this
purpose. Antenna array size provides an added flexibility to shift the tradeoff curve in
the diversity-multiplexing space.
In order to maximize diversity gain, we have to supply to the receiver multiple
independently faded replicas of the same symbol, so that the error probability is
reduced [13] [14]. The data replicas can be sent in space and/or time directions. Since
a unified signal model can be developed for these space-time (ST) coding schemes,
the same sphere decoder architecture can be used with some data rearrangement.
Sphere decoding supporting algebraic ST codes [48], linear dispersion code [49], and
space time block code (STBC) [15] were reported in prior work. The ML estimate can
be written as
2||||minargˆ Bsys (4)
8
where matrix B depends on code generators and channel matrix. By interpreting B as
H in the original signal model, sphere decoding algorithm can be applied. Since the
matrix dimension we deal with is changed due to the data rearrangement in the
preprocessing stage, the equivalent antenna array size will be changed accordingly.
For example, repetition coding by 2 in space domain for an 88 system will be
transformed into data processing in a 44 system (only one half of symbols need to be
decoded). This requirement enhances the need for flexibility in antenna array size.
Spatial multiplexing gain is characterized by data rate. To maximize spatial
multiplexing gain, we should allow data rate to scale with the SNR or assign different
data rate to different substreams for a fixed SNR [5][15]. To this end, modulation
scheme should be adaptive according to channel condition: a larger constellation is
applied to substreams with higher SNR, and a smaller constellation is applied to
substreams with lower SNR. In principle, this transmission strategy just uses
water-filling in space domain. The system performance perspective, therefore, further
motivates the need for adding flexibility in modulation schemes.
III. ARCHITECTURE SPACE EXPLORATION
The optimal architecture is decided by jointly considering tradeoffs at the
algorithm, architecture, and circuit layers of abstraction, with the goal of minimizing
chip power and area. As shown in Fig. 4, a layered design approach is adopted to
merge algorithm and circuit decisions. An efficient multiplier is proposed to reduce
area and delay at the same time. Saving in area directly translates to power reduction
since power spent in charging/discharging parasitic capacitances is also reduced. At
the processing element (PE) architecture level, we evaluate the existing architectures
[16][17][19][21][22][24] and propose a solution with improved area and throughput.
Unlike prior work, flexibility is also considered in the design stage. Antenna size,
modulation scheme, number of subcarriers, and search method are designed with
flexibility and scalability to cover multiple communication scenarios. A multi-core
architecture which consists of many PEs (―small cores‖) is developed to support the
tradeoff between range and data rate at the system architecture level. We finally
summarize the flexibility, scalability, and system specification.
9
...
...
S1^S0
<<21
0
neg
<<1
0
1
1
-1
x4
x8
1
0
neg
S2
S0
S1&S0_b
R0
1
PE a
rch.
Met
ric
calc
.
Multip
lier
Sys
tem
arc
h.
...
...
S1^S0
<<21
0
neg
<<1
0
1
1
-1
x4
x8
1
0
neg
S2
S0
S1&S0_b
R0
1
PE a
rch.
Met
ric
calc
.
Multip
lier
Sys
tem
arc
h.
Fig. 4. Illustration of layered design approach.
A. Numerical Strength Reduction
From an algorithm perspective, the complexity of sphere decoding is evaluated by
the number of nodes visited in the tree search process. When considered for hardware
implementation, decoding algorithms are generally compared in terms of the number
of multiplications. Down to the circuit level, the size of multipliers is the key factor to
estimating the area, speed, and power of the sphere decoder.
We start with simplifying the cost of the multiply operation to reduce hardware
complexity. The multiplication is required to calculate Euclidean distance, which is
mathematically represented by two equivalent forms, Eqs. (5), (6).
2||)(||minargˆZFML
ssRs (5)
2||||minarg RsyQ H (6)
Seemingly, the number of multiplications in Eq. (5) is less than in Eq. (6): one
multiplication for Eq. (5) and two multiplications for Eq. (6). Hence, Eq. (5) was most
commonly used in prior work [16]-[21] as a baseline for implementation. However, a
careful investigation shows that Eq. (6) is a better choice from hardware perspective
for at least two reasons. First, we observe that sZF and QHy can be pre-computed and,
hence, have negligible impact on the total number of operations. Also, computation
effort of sZF is not less than QHy. Second, the wordlength of s is usually much shorter
than sZF. Separating terms as in Eq. (6) results in multipliers with reduced wordlength.
Without loss of generality, the normalized size of a multiplier can be estimated by
the product of wordlength of the multiplier and multiplicand. The normalized delay of
a multiplier can be estimated by the sum of wordlength of the multiplier and
10
multiplicand if an array multiplier is used [39]. The array multiplier approximation
works well for first-order comparison purposes. Table 1 summarizes the relative area
and delay reduction of a multiplier due to numerical strength reduction in a 64-QAM
system, where wordlength (WL) of s is 3 for a real multiplier. We see that the area
reduction is at least 50%, and that the delay reduction also reaches 50% for large
wordlength inputs. The absolute area difference between these two types of
multipliers is amplified by the total number of multiplications in the entire decoding
process, which is approximately O(M3).
TABLE I
AREA AND DELAY REDUCTION DUE TO NUMERICAL STRENGTH REDUCTION
WL of sZF 6 8 10 12
WL of R =12, Area/delay 0.5/0.83 0.38/0.75 0.3/0.68 0.25/0.63
WL of R =16, Area/delay 0.5/0.68 0.38/0.63 0.3/0.58 0.25/0.54
The multiplier can be simplified further by taking advantage of some
characteristics of communication signal processing: Gray coding and quantization
effects. Gray code is a more compact representation in the constellation plane since
only odd numbers are used. Conventionally, the number is transformed to 2’s
complement representation for the purpose of arithmetic operations. Carefully
examining the Gray code representation, the corresponding multiplication can be
implemented by simple shift, add and invert operations. The code mapping, the
associated operations, and the simplified multiplier are shown in Fig. 5. The neg
operator stands for bit-inversion. 1-bit carry-in in 2’s complement can be absorbed as
a carry-in (shaded in gray) in the following adders or simply be discarded as a
quantization error on LSB, which can be recovered by wordlength optimization.
The shifter has no direct area cost apart from routing, while the cost of inverters
and multiplexers is relatively low because they are simple operations. Overall, it is
possible to implement one complex multiplier with 6 adders + inverters and
multiplexers, resulting in a total 40% area reduction compared to traditional approach
(area is estimated by Synopsys Design Compiler). This implementation does not
imply that we have to force the use of Gray coding in the constellation plane; the Gray
coding is only used inside the sphere decoder to simplify metric calculation and
candidate enumeration. The decoded symbols can be converted into any arithmetic
representation at the sphere decoder outputs.
Gray code 000 001 011 010 110 111 101 100
value -7 -5 -3 -1 1 3 5 7
operation -7 -5 -3 -1 1 4-1 4+1 8-1
11
S1 S0
<<21
0
neg
<<1
0
1
1
-1
x4
x8
1
0
neg
S2
S0
S1 S0
R0
1
S1 S0
∩
Fig. 5. Gray code representation and the simplified multiplier.
B. Architecture Tradeoff
In the prior work, two major types of tree search methods are reported: depth-first
(DF) [23] [24] and K-best [16]-[22]. The depth-first algorithm starts the search from
the root of the tree and explores as far as possible along each branch, then it
back-traces until a leaf node is found. The K-best algorithm approximates a
breadth-first search by keeping only K branches with the smallest partial Euclidean
distance (PED) at each level [26], which is similar to the M-algorithm in trellis
decoding [27]. The major advantages of DF are that the ML performance can be
achieved, and that radius shrinking can be used for tree pruning. On the other hand,
the advantages of K-best are its uniform data path and constant throughput.
Further examining details, depth-first ensures the ML performance if complete
solution space is explored. This might not be feasible in practice, however, because of
limited buffer size and processing cycles. This means that some termination schemes
should be used and thus ML performance is no longer guaranteed. Since the default
input is uncoded data, achieving a sub-optimal performance while keeping constant
throughput is more important. Then, space-time codes or error correction codes can be
used to improve the performance. The iterative decoding scheme which combines
MIMO decoder and error correction code decoder was proven to achieve
near-capacity performance [2].
In hardware implementation, depth-first is realized in a folding-like architecture
because only one node is visited at a time during the tree search process. In this case,
an extra memory to record the visited nodes is required, for the trace-back operation.
K-best is realized in a multi-stage pipelined way, because no trace-back is needed. To
process K data paths at the same time, parallel architecture is applied. Figure 6
illustrates the basic architectures of these two search schemes, and Table 2
summarizes their comparison in terms of circuit metrics and algorithmic performance.
12
(b) K-best (parallel and multi-stage)(a) Depth-first (folding)
PE
PE
1
PE
2
PE
M...
Fig. 6. Basic architecture of (a) depth-first and (b) K-Best algorithm.
TABLE II
COMPARISON OF DEPTH-FIRST AND K-BEST ALGORITHM
Area Throughput Latency Radius Shrinking
/Tree Pruning Performance
Depth-first Small variable long Yes ML
K-best large Constant short No Near-ML
For the sphere decoder operating with a large antenna array, the biggest challenge
in the implementation is reducing area of the design. Using the number of (complex)
multipliers as a first order area estimate, the number of multipliers needed in the
folding and multi-stage architectures are M and M(M+1)/2, respectively, where M is
the number of transmit antennas. Expanding a 44 system to a 1616 system, relative
area increases from 4 to 16 for the folding architecture and 10 to 136 for the
multi-stage architecture. The folding architecture is 2.5 to 8.5 more area efficient
compared to the multi-stage architecture, as shown in Fig. 7 (a). To keep the area
within a reasonable value, folding technique is considered. The second design
challenge is operating frequency for the folded architecture.
As the array size increases, the number of operands in the Multiply-Accumulate
(MAC) operation in the metric calculation unit increases proportionally to the number
of antennas. Assuming a tree adder design, the critical path delay roughly increases
linearly with the number of transmit antennas. However, the time required to finish
the MAC operation should be scaled down by the number of antennas in order to
increase the throughput proportionally to the number of antennas. This timing
requirement for a fixed bandwidth is shown in Fig. 7 (b). The situation is actually
worse when metric enumeration is included in the loop. Since pipelining in the loop is
considered a difficult task, this architecture can not operate at a high frequency even
for a 44 system [24].
To facilitate pipeline insertion, inputs are up-sampled by a factor m, and then one
register can be replaced with m pipeline registers in the loop using Noble Identity [42].
In this case, only one out of m samples is useful data, and the rest could be repeating
13
values of an original sample or padding zeros. By applying data-stream interleaving,
samples of other independent data streams can be introduced in the loop in place of
the repeated values or padding zeros. Technique of interleaving is therefore used to
improve area efficiency through logic sharing and to provide flexibility needed to
support varying number of data sub-carriers. In a multi-carrier communication system,
data streams are transmitted over narrow-band sub-carriers [28].
8x8
Are
a
16x164x4
x8.5
x3.5x2.5
Antenna array size8x8
De
lay
16x164x4
timing requirement
critical path in the loop
Timing gap
Antenna array size
(a) area reduction using folding technique (b) growing timing gap in folding architecture
multi-stage
folding
Fig. 7. Design challenge and tradeoff for large antenna size. Impact of antenna array size on (a) area
and (b) critical path delay.
C. PE structure
The function of the PE is to find the si with minimum Ti ( iiiiisRbT ) for each
level in the tree search, and to provide a candidate list with Ti in a descending order
since a path with smaller Ti means a higher probability to be the ML estimate. A
straightforward algorithm mapping is to enumerate all possible constellations and sort
the Ti to find the si and the candidate list [24]. The hardware cost and computational
latency of this architecture is very high for a large constellation size due to the circuit
parallelism and inevitable latency of the sorting circuit. To resolve this problem, we
propose another strategy: first, the closest point is found through the geometric
relationship since the si with minimum Ti stands for the closest point between bi and
RiiQi. The second step is to use the selected si to calculate Ti. Finally, the candidate list
is generated by the constellation arrangement, as described in Section III-C-2, Fig. 12.
We decompose the PE into two parts: Metric Calculation Unit (MCU) and Metric
Enumeration Unit (MEU). Each submodule can be mapped to Area-Energy-Delay
space to explore optimal design parameters for the top-level integration.
Decomposing a design problem along these three axes provides insight into design
techniques and their impact on power, area, and throughput. Concurrency versus
latency is one of the basic tradeoffs that need to be considered. Maximizing data
14
throughput calls for a parallel architecture, which results in a large area. Conversely,
time-multiplexing improves area efficiency, but increases latency. For example, the
decoding algorithm operating on complex numbers can be transformed into a
real-valued problem, which results in a tree that is twice as deep as the original tree
with a smaller number of children per node [16]. Since the multipliers are reused, the
number of multipliers is reduced to one half at the cost of equal throughput reduction.
Flexibility is another issue in circuit design. Ideally, the circuit should be flexible
to support different search schemes (Depth-first or K-best). In general, the overhead
of flexibility results in reduction of both energy efficiency and area efficiency. This
overhead should be minimized while maintaining system performance. Fig. 8 shows
the circuit diagram of one PE. There are m-stage pipeline registers inserted in the loop,
so the critical path can be shortened under the timing constraint by choosing a larger
m. Since m data streams are interleaved into the PE, the hardware always keeps active,
creating the maximum throughput as if the m pipeline registers are introduced without
the loop. The area overhead of the up-samplers for R can be removed if R is invariant
for each sub-carrier during one packet transmission. The flexibility of search scheme
is provided by the shift-register chain, which can be configured as forward trace or
backward trace. By placing K PEs onto one sphere decoder, K search paths are
explored at the same time to implement K-best algorithm, while each PE has
flexibility to trace back as Depth-first. The flexibility to support varying antenna size
is provided by the folding architecture. It reuses the same hardware with a higher
clock frequency as the antenna size increases. The details of sub-modules are
illustrated in the following.
sub
shift-register chain
Symbol
selection
sub
R
s
bi
. . .
myi^
m stagesadder tree
| |2
. . .
. . .
Ti
partial product
MCU
MEU
m
Rii
Fig. 8. Circuit diagram of one PE.
15
1) Metric Calculation:
Metric Calculation Unit (MCU) computes
M
ij
jijsR
1
. Basically, it executes a
Multiply-Accumulate (MAC) operation. To accumulate the maximal 16 operands and
achieve the highest throughput, there are 15 simplified complex + 1 simplified real
multipliers followed by an adder tree that merges the partial products. It is possible to
reduce the number of multipliers in a time-multiplexing manner at the price of lower
throughput [30]. For example, 4 complex multipliers can be time-multiplexed by 4 to
deploy 16 multipliers, with throughput reduced by 4. Such tuning at the architecture
level is used to position the design along throughput and power axis, with optimal
tuning of variables such as supply voltage.
Since the search process advances one stage per clock cycle, we propose an
FIR-like architecture to facilitate metric calculation, as shown in Fig. 8. If only
forward trace is allowed, the BER performance is limited by the number of parallel
processors such as in K-best algorithm. Even though more processing cycles are
provided, there is no room to improve the BER performance for K-best algorithm. By
observing that the trace-back goes back up by only one or two layers instead of a
random jump, a bidirectional shift register chain is embedded to adjust the search
depth. Since the search state is recorded in the shift registers, no extra memory, such
as stack memory, is needed to keep all the states [26] [40]. Due to the trace-back
requirement, transpose form FIR architecture is not suitable to reduce the critical path,
but the critical path is reduced by data-interleaving.
Ri,i Ri,i+1 Ri,M
. . .
Ri,i+2
. . .
si+1
si
si+2 sM
adder tree
Fig. 8. Circuit diagram of MCU.
Coefficients of R matrix are stored in memory in an area efficient way. The
diagonal terms of R matrix are real, while the rest are complex numbers. Using the
upper triangular nature, the Real part diagonal and the Imaginary part triangular data
are organized into a square memory, which saves around 50% of area.
16
2) Metric Enumeration:
The Metric Enumeration Unit (MEU) enumerates the possible constellation points
according to their Partial Euclidean Distance (PED) (
2
||
i
Mj
jT ) in an ascending order.
Exhaustive search is a straightforward implementation; it calculates the PEDs of all
constellation points and uses a sorting circuit to find the minimum one, as shown in
Fig. 10 (a). The number of distance calculation units is proportional to the
constellation size (64 units are required for 64-QAM, for example). This requirement
in itself makes hard to support a large constellation size, in addition to the extra
latency introduced by the minimum search circuit.
In the constellation plane, metric enumeration is equal to finding the points closest
to bi and scaling constellation points RiiQi from the closest to the farthest [2]. This is
the underlying principle of Schnorr-Euchner (SE) algorithm. The SE enumeration is
originally applied to one dimensional signal, such as real valued PAM or PSK
constellation; therefore it was modified to arrange QAM constellations in PQ
concentric groups to fit the original algorithm. For example, 16-QAM constellation
can be expressed as an arrangement of points in 3 concentric circles. Then the
problem is reformulated to find the closest point in each subgroup and find the closest
point over subgroup, as shown in Fig. 10 (b) [24].
si
^
RiiQ1 | |2
| |2
. .
.
| |2
min
-searc
h
bi
sub
sub
sub
. .
.
RiiQ2
RiiQk
si
^
min
-searc
h
bi
PSKALU 1
PSKALU 2
PSKALU PQ
. .
.
. .
.
bi
Region decision
Region decision
si
^
real part
imag. part
Rii
Rii
(a) exhaustive search (b) SE enumeration for QAM
(c) region partition search
RiiQi
Q
I
bidecision
boundary
Fig. 10. Closest point selection scheme: (a) exhaustive search architecture, (b) SE enumeration for
QAM, (c) region partition based search approach. Real value is represented by gray line.
The original algorithm [2] uses phase relationship to find the closet point in a
concentric circle. This approach is not suitable for hardware implementation, so [24]
17
proposed a decision boundary based method to simplify the SE enumeration. One
type of decision boundary is set by straight lines passing through the origin and the
middle point between two adjacent constellation points in a concentric circle, to
specify the starting point. Another type of decision boundary is set by straight lines
passing through the origin and the middle point between two constellation points
around the starting point in a concentric circle, to determine the initial search direction.
However, this simplification is only applicable to small size constellations such as
16-QAM. Larger constellation sizes are hard to support for several reasons. First, the
decision boundary algorithm is quite complex–many multiplications are needed to
generate the decision boundaries. Second, the number of subgroup grows quickly,
which increases the latency of the min-search circuit. For example, 64-QAM is
decomposed into 9 subgroups. Third, the concentric group partition is scalable as
QAM constellation size changes, thereby making the architecture infeasible to support
different modulations.
We propose a simple partition method based on Cartesian coordinates. The
constellation plane is partitioned into 64 regions for 64-QAM (8 regions in the Real
part and 8 regions in the Imaginary part). The closest point (with minimum distance)
can be decided by the location of bi/Rii since real part and imaginary part can be
decoded separately, as shown in Fig. 9 (c). In fact, this idea is also applied to symbol
decision. For instance, to make a decision on a QPSK system, we do not need to
calculate the distances from received signal to 4 constellation points. Instead, we just
need to examine the sign of real and imaginary parts.
The area complexity of the three architectures in Fig. 9 is evaluated using the
number of add-equivalent operators (add, subtract, compare) as area estimation. For
64-QAM constellation, exhaustive search needs 64 subtractors, 64 square operators,
and a min-search circuit. Assuming the square operators are simplified to absolute
operators with a little performance loss [24] and that min-search uses a serial
comparison circuit, total 192 adder equivalent operators are need. SE enumeration for
64-QAM needs 64 boundary decision comparisons and min-search across 9 subgroups,
so 73 add-equivalent operators are need assuming the boundary is given. The
proposed region partition search needs 8 comparators for real part and 8 for imaginary
part, which is only 16 add-equivalent operators. Therefore, 4.6 area reduction is
achieved compared to SE enumeration for 64-QAM and 32 compared to exhaustive
search.
Similar concept is applied in delay comparison: the number of adder delays is
18
used as delay estimation metric. Here, we assume the delay of min-search circuit is
equal to n2
log , where n is the number of sorting elements. However, a serial
comparison circuit needs n adder delays to finish the comparison, so a more area
consuming parallel architecture should be used to reduce the delay. The delay of
exhaustive search is approximated by the sum of delay of 1 adder, 1 absolute, and
64log2
, which is equal to 8. Delay of SE enumeration is equal to 1 operators plus
9log2
= 5. Our design needs only 1 comparator, which is 1/5 the delay of the SE
enumeration without pipelining.
TABLE III
AREA AND DELAY COMPARISON
Exhaustive SE enumeration Our work
Area (normalized) 192 73 16
Delay (normalized) 8 5 1
One challenge in the MEU implementation is that a divider or an inverse operator
seems inevitable to calculate bi/Rii, which usually introduces a longer latency and
higher hardware complexity. The property that diagonal element Rii of R matrix is real
simplifies the problem, but still introduces hardware overhead. One possible method
is to calculate 1
iiR in the preprocessing stage, since these values are updated at a
packet rate [16] [21]. If 1
iiR is given, only one multiplier is needed. In our approach,
we can demonstrate that this inverse operation is not necessary.
Instead of deciding 1
iiiRb in the constellation plane, it is equivalent to deciding bi
in a constellation plane scaled by Rii. The decision boundary (db) is denoted as
}6,4,2,0,2,4,6{ db , then we simplify 1
iiiRb to
iiRdb calculation. It may
seem that replacing one multiplier with 6 multipliers in order to execute the boundary
comparison in a parallel way may not be a good tradeoff from the area standpoint.
However, a careful examination reveals a large multiplier is replaced with small
multipliers, and that these small multipliers can be simplified as shift-add operators.
Therefore, only one adder is needed to implement ii
Rdb (iiiiii
RRR 246 );
others can be implemented by hard-wired shifting and inversion. The negative value
can be computed by bit-inversion without the carry-in bit, because carry bit appears as
negligible quantization error from the signal decision perspective. The area reduction
is quite high. If the wordlength of 1
iiR is L, then the multiply operation with large
WL )]([i
bWLL is replaced with add operation which also has smaller number of
bits ]3[ L . The simple region decision circuit is shown in Fig. 11.
19
>6Rii
>4Rii
>2Rii
>0
>-2Rii
>-4Rii
>-6Rii
5
3
1
-1
-3
-5
s[2] s[1] s[0]
constellation
size
Rii Sign
Real{bi}
/imag{bi}
7
Symbol
remapping
Fig. 11. Region decision circuit.
An extra symbol remapping block is inserted at the end to remap constellation
points if different constellation size is used. Decision outputs are mapped to Gray
code directly without extra 2’s complement representation and Gray code
transformation. Table 4 shows the mapping rules. Although Rii can be chosen always
positive to simplify this circuit further, we leave the flexibility of supporting negative
value as well in order to relax QR decomposition processing. With the proposed
approach, no sorting is needed and it is easy to expand to a large constellation size.
Additionally, the use of bit-level arithmetic results in only linear complexity increase
as the constellation size grows exponentially.
TABLE IV
SYMBOL REMAPPING AND DECISION
64-QAM 16-QAM QPSK/BPSK
s[1:0]
00 00 00
11
00
10 01 01 01 01
11 11 11 11
10 10 10 10 10
real imag
s[2] s[1] s[0] s[2] s[1] s[0]
64-QAM (6-bit)
16-QAM (4-bit)
QPSK (2-bit)
BPSK (1-bit)
After finding the closest point, remaining candidates are also decided by the
distance between bi and constellation points in an ascending order. The decoded
symbol si is used to enumerate remaining candidates through geometric relationship
rather than sorting either in trace-back or parallel search mode. The complexity of the
20
sphere decoding algorithm is independent of the lattice constellation size [48];
therefore, we can enumerate the adjacent possible constellation points instead of the
whole constellation plane. We extract 9 points in the constellation plane as illustrated
in Fig. 12. Eight surrounding constellation points have either 1-bit error (Fig. 12 (a-b))
or 2-bit errors (Fig. 12 (c-d)) if Gray coding is used. The 2nd
closest point for each
solution set is decided based on decision boundaries indicated by the dashed lines in
Fig. 12 (a), (c). The remaining points are decided by the search direction, which is
specified by other decision boundaries, starting from the 2nd
point, as shown in Fig. 12
(b), (d). These two decision boundaries are easy to calculate by sign check and
comparison for {Re} and {Im}. The search sequence of each group is well-defined,
but the boundary between these two groups is not easy to calculate. For example,
which 3rd
search point in these two groups, Fig. 12 (b) and (d), is the closer point can
not be decided by a simple boundary. Therefore, we adopt a mixed method: the two
solution sets are compared to find the final enumeration sequence with respect to the
central point.
110110
110111
110101 111101
111111
111110 101110
101111
101101
real part
bi
Riisi
Imag. part
110110
110111
110101 111101
111110 101110
101111
101101
111111
110110
110111
110101 111101
111111
111110 101110
101111
101101
110110
110111
110101 111101
111111
111110 101110
101111
101101(a) (b)
(c) (d)
110110
110111
110101 111101
111111
111110 101110
101111
101101
#1 #2
#3
#4
#5
#2
#1
#2 #2
#3#4
#5
1 bit error
subset
2 bit errors
subset
2nd
closest point 3rd
to 5th points
Fig. 12. Candidate enumeration scheme. Decision boundaries are dashed lines in the central region.
Fig. 13 shows the overall area reduction for one PE. An overall 20 area reduction
is achieved through various signal processing and circuit techniques, from arithmetic
stage down to circuit stage. If 16 sub-carriers are processed through data-stream
interleaving, the equivalent area reduction would be more than 260 times. So far, we
have built a one-PE sphere decoder. To speed up the search and improve error
probability, multiple PEs need to be utilized to span the search range. A multi-core
architecture is proposed to cooperate all the functional blocks in a power and area
efficient way.
21
folding simplified multiplier
memory reduction
wordlenghreduction
initial
x8.5
20%5%
20%
total 20xreduction
Are
a
MEUsimplfication
30%
Fig. 13. Summary of area reduction for one PE.
D. Multi-Core Architecture
Multiple-PE architecture inherently improves the search speed by the number of
processing elements. However, the search speed is further increased since the shorter
paths can be found earlier thereby pruning the tree more efficiently. In addition, the
number of processing elements offers the flexibility to trade performance with area.
Virtually all K-best architectures use parallelism to search several branches at the
same time [16]-[22], but they do not take advantage of the important features of
sphere decoding—radius shrinking and tree pruning.
When the search paths run outside the search radius, they should be discarded
instead of continuing with a deeper search. Intuitively, we should assign a new search
branch within the search radius to the processors whose search paths outside the
search radius. To maximize the probability of finding the ML estimate, the children of
the branch with smaller Euclidean distance for that level are assigned as the new
search candidates. Therefore, the functions needed include: (1) sorting circuit to
record the branch with minimum Euclidean distance, (2) radius checking block to
examine if the Euclidean distance is larger than the search radius, and (3) candidate
enumeration circuit, illustrated in Fig. 12. Since the radius checking block is included
in the sphere decoder, one of the many algorithms for effective radius shrinking can
be utilized [2] [3] [10] [12].
1) Sorting Circuit:
Sorting algorithms are extensively studied in computer science. In hardware,
several architectures are widely used: serial sorting, parallel sorting (Batcher sorter)
and Single Instruction Multiple Data (SIMD) architecture [16][33-36]. Serial sorter
executes the bubble sorting algorithm [16]. The serial comparison nature results in a
longer latency. Parallel sorter is widely used in packet switch networks sorter, which
22
makes use of parallelism to speed up sorting at the cost of increased area. SIMD
provides the largest flexibility, but its interconnect network is very complex. A
comparison of these architectures is summarized in Table V. For N inputs, Nn2
log .
Latency and Area are estimated as the number of comparator delays and the number
of comparators, respectively.
TABLE V
SUMMARY OF SORTING CIRCUITS
Serial Parallel SIMD
Latency N n(n+1)/2 n(n+1)/2
Area N/2 (n2+n)N/4 N/2
Routing complexity Low Medium High
Area is the first priority in the design of sorting circuit, because the sorting circuit
needs to be replicated to support multiple sub-carriers. Leveraging the
data-interleaving operation, N − 1 time slots are available for additional sub-carriers,
which makes serial comparison possible within a symbol period. Therefore, serial
sorter is selected in our design. Since the first input is loaded into the register of the
first stage, the latency is N − 1 cycles (one cycle saved). Fig. 14 shows the circuit of
serial sorter. For each comparator, the larger operand is sent to the lower branch and
the smaller one is sent to the upper branch. The final sorted Euclidean distance from
each PE can be used for outer receiver for iterative decoding.
comp-
areH
L
stage 1
comp-
areH
L
stage 2
...comp-
areH
L
stage M/2
Fig. 14. Circuit diagram of a serial sorter.
2) Radius Checking:
Radius checking is executed with parallel sorting. Euclidean distances stored in all
PEs are checked serially. If the Euclidean distance is larger than the search radius, a
new search path is assigned. On the other hand, if the Euclidean distance is smaller
than the search radius, then the search radius is updated to this smaller value and the
corresponding branch is chosen as the ML estimate.
A multi-core architecture is proposed to coordinate all functional blocks. The
number of PEs are decided from BER-are-power tradeoff. A 16-PE architecture is
shown in Fig. 15. For each PE, the decoded symbols and the associated Euclidean
23
distance for 16 sub-carriers are fed into registers serially after processing. For each
cycle, only the metrics of one sub-carrier are computed, while other sub-carriers
conduct sorting, radius checking, and candidate enumeration across PEs. A sorting
circuit connecting 16 registers belonging to the same sub-carrier is embedded. Radius
checking is conducted serially using a multiplexer, and followed by a new path
assignment conditionally.
PE
SC-1 SC-2 SC-3 SC-4
SC
-5S
C-6
SC
-7S
C-8
SC-9SC-10SC-11SC-12
SC
-13
SC
-14
SC
-15
SC
-16
Demux
MEU
MemoryMCU
I/O
In
terf
ace radius
checking and
updating
PE-1 PE-2 PE-3 PE-4
PE
-5P
E-6
PE
-7P
E-8
PE-9PE-10PE-11PE-12
PE
-13
PE
-14
PE
-15
PE
-16
Sub-carrier space
MuxPE-1
PE-5
PE-13
PE-2 PE-3 PE-4
PE-6
PE-7 PE-8
PE-9PE-10PE-11PE-12
PE-14
PE-15
PE
-16
Fig. 15. Multi-PE sphere decoder architecture.
With this compact multi-PE architecture, the sphere decoder provides a very high
performance. At 256MHz, each PE provides 46.5GOPS (12-bit equivalent add), and
total operations for 16-PE architecture amount up to 800GOPS (including sorting and
radius checking circuits) for the whole system when operating at the 1616 64-QAM
mode. In addition to high performance, flexibility and scalability are also included.
We illustrate the design specifications next.
E. Design Specifications
The sphere decoder is designed to support different system configurations with
respect to antenna array size, modulation and detection schemes, as well as the
number of sub-carriers. Table 6 summarizes the configuration modes. Since varying
antenna array size and modulation are supported, this design is also capable of trading
off diversity gain for spatial-multiplexing if STBC is used. Due to interleaving by 16,
the supported number of sub-carriers can be a multiple of 16 through data
rearrangement.
TABLE VI
OVERVIEW OF SYSTEM CONFIGURATION MODES
Configuration Modes
Antenna array size Any square matrix # b/w 22 − 1616
Modulation BPSK, QPSK, 16-QAM, 64-QAM
# sub-carriers 16, 32, 64, 128
Detection Depth-first, K-best
24
Main design specification is the throughput constraint for the algorithm. Since
total 16 MHz bandwidth is used, each sub-channel requires 1MS/s to process the data
in the case of 16 sub-carriers. The requirement is thus to process 16 parallel streams
of data at a 1MHz rate. Clock specification for the resulting architecture then becomes
256 MHz (1MHz 16 sub-carriers 16 antennas). Notice the clock frequency of all
modes can achieve 256MHz. The clock frequency for smaller array size is reduced
due to a fixed channel bandwidth. Detailed system specifications are listed in Table 7
for array size 44 to 1616. We see the system supports ideal throughput up to
1.536Gbps, which results in a spectral efficiency up to 96 bps/Hz. When the system is
operated at a smaller array mode, clock frequency and supply voltage can be reduced
to minimize power consumption.
TABLE VII
SUMMARY OF SYSTEM SPECIFICATION
Antenna array 44 88 1616
Modulation QPSK 16-QAM 64-QAM QPSK 16-QAM 64-QAM QPSK 16-QAM 64-QAM
BW (baseline) 16 MHz
Clock freq. 64MHz 128MHz 256MHz
Throughput (bps) 128M 256M 384M 256M 512M 768M 512M 1.024G 1.536G
Spectral Efficiency
(bps/Hz) 8 16 24 16 32 48 32 64 96
A comparison of hardware is illustrated in Table 8. The estimated chip area is 0.55
mm2 in a standard 90 nm CMOS process using the approximation of 10,000 FPGA
slices 1 mm2 layout area in 90 nm CMOS [28]. To make a fair comparison, the
area is normalized by the number of transmit antennas (this is a conservative estimate,
because the hardware complexity could grow quadratically with the number of
transmit antennas). The data indicates that the proposed architecture is the most area
efficient compared to prior work. Furthermore, our design outperforms all previously
published designs in terms of supported antenna array size and constellation size, as
shown in Fig. 16. Unlike previous work, the proposed architecture also supports
multiple sub-carriers and search methods. Finally, this is the first design that offers
the flexibility required to fully traverse the diversity vs. spatial-multiplexing tradeoff
curve.
TABLE VIII
HARDWARE COMPLEXITY COMPARISON
[19] [17] [21] [22] [24] This work
Area 500k
GC
10 mm2
(0.18um)
12.7
slices
97k
GC
50k
GC *
Area (norm.) 6.5 2.5 9.2 18.2 1.3 1
*154k gate count (GC), 0.55 mm2 (90nm), or 5.5 slices
25
BPSK QPSK 16QAM 64QAM
4x4
8x8
16x16
Modulation
An
ten
na
arr
ay s
ize This work
[17][21][22][24]
[19]
Fig. 16. Comparison of this work with previous work.
IV. DESIGN METHODOLOGY
An integrated design methodology is adopted in our work to incorporate algorithm,
architecture, and circuit implementation in a highly automated environment. Since the
design is complex, we start with a layered design approach which decomposes the
whole system from the top architecture down to the fundamental modules
hierarchically. Different considerations such as area and throughput are evaluated at
each layer for architecture optimization. A graphical Simulink/Matlab development
environment offers bit-true, cycle-by-cycle hardware equivalent modules for
simulation, and then translates to FPGA emulation without hardware description
language (HDL) coding. Due to the limited capacity of single FPGA, BEE2 platform
[38] is used to accommodate the whole system and speed up emulation.
A. Simulink-Based Design Environment
We use Simulink/Matlab design environment [44]. Traditionally, circuit design for
communication signal processing is divided into two stages: algorithm design and
circuit implementation. Algorithm designers use C/C++ or Matlab for system
simulation, and then the designed architecture is implemented by circuit designers
using HDL. There are usually several iterations between two design stages to ensure
the final design satisfies the specifications. In this work, Xilinx System Generator
(XSG) block-sets are used to build hardware equivalent modules, which leverages
cycle-accurate software simulation. In addition, quantization effects due to finite
wordlength are considered in the simulation. Area information is extracted by
resource estimator (XSG) or design compiler (Synopsys) in terms of number of slices
or area in the early design stage, since the equivalent HDL can be generated
automatically. The drawback of simulink-based design flow is its lengthy simulation
time, which can be mitigated by FPGA-based hardware emulation [43].
26
B. Emulation Platform
FPGA-based hardware emulation and rapid prototyping have become an attractive
solution, which can provide up to 106 times faster simulation speed than software
simulation [37]. Xilinx University Program (XUP) board (Virtex-2 Pro 30 part) [50] is
used to develop the hardware/software cosimulation environment for small circuits. In
this case, the hardware modules built in the Simulink is replaced with the configured
FPGA to speed up simulation. Due to the limited capacity of XUP board, BEE2
platform is used for whole system emulation.
The BEE2 consists of 5 Vertex-2 Pro 70 FPGAs (~10M equivalent logic gates
total). Each FPGA embeds a PowerPC core which minimizes the latency between the
microprocessor and reconfigurable logic. Four FPGAs (user FPGA) are used for
computation and one for control (control FPGA) as shown in Fig. 17. With high speed
bandwidth, low latency links, BEE2 provides a virtual single FPGA of five times the
capacity [38].
User FPGA-1
User FPGA-4
User FPGA-2
Ctrl FPGA
User FPGA-3
Fig. 17. BEE2 emulation platform.
C. Simulation Results
The BER performance of one PE is verified through the hardware/software
co-simulation environment. In this preliminary experiment, ZF-DFE/BLAST
algorithm is adopted, i.e. for each level of the search tree topology, only the closest
lattice point is chosen as the decoded symbol [41] [49]. Since only a small portion of
the solution space is examined, there exists a performance gap between this scheme
and ML solution. However, we demonstrate a system with a larger antenna array and
repetition coding can outstrip the ML performance with a smaller antenna array easily.
The BER performance can be further improved to achieve ML performance without
repetition coding by using multiple PEs, which is being designed.
Fig. 18 (a) shows the BER performance of 64-QAM modulation for different
27
antenna array sizes and different repetition coding rates. The repetition coding here is
referred to sending replicas in space domain to reduce error probability. We see the
performance of 44, 88, and 1616 is comparable, but the throughput is different
given a fixed bandwidth. By using repetition coding by a factor 2, the throughput will
drop to one half, but the BER performance is improved. Therefore, the throughput of
44, 88 with repetition coding by 2, and 1616 with repetition coding by 4 is the
same, but the BER performance is improved significantly. The performance of the ML
estimate is depicted in Fig. 18 (b) as a reference, which is the information theoretic
bound for a 44 system. An 88 system with repetition coding by 2 has outperformed
the 44 with the ML performance by 5dB.
0 5 10 15 20 2510
-6
10-5
10-4
10-3
10-2
10-1
100
Eb/No (dB)
BE
R
16x16 64QAM
8x8 64QAM
4x4 64QAM
8x8 64QAM, rep. x2
16x16 64QAM, rep. x4
(a)
0 5 10 15 20 2510
-6
10-5
10-4
10-3
10-2
10-1
100
Eb/No (dB)
BE
R
4x4 16QAM
4x4 16QAM ML
8x8 16QAM, rep. x2
16x16 16QAM, rep. x4
(b)
Fig. 18. Selected BER performance of one PE.
28
V. CONCLUSION
This work proposed a flexible sphere decoder architecture for extracting diversity
and spatial multiplexing gains for MIMO communications. The choice of architecture
is evaluated in terms of hardware complexity, throughput requirement, and flexibility.
To keep the sphere decoder with large dimensional signal processing feasible, design
tradeoffs are jointly considered at algorithm, architecture and circuit layers.
Several signal processing and circuit techniques are used to reduce hardware
complexity while keeping high throughput requirement. We start from multiplier
simplification by numerical strength reduction and numerical representation
manipulation. A folding and interleaved architecture is used to reduce area and break
down critical path simultaneously. Also, hardware reuse provides flexibility to support
varying antenna array. By taking advantage of bi-directional shift register chain, huge
memory usage to store search states is saved and backward trace is offered. Metric
enumeration is implemented using an efficient boundary decision algorithm and
associated simplified circuit, which supports multiple modulations with a negligible
area cost. Finally, a multi-core architecture is proposed to coordinate all processing
elements. This unique architecture interleaves multiple sub-carrier data streams and
searches possible paths in parallel for each sub-carrier. Meanwhile, processing
intervals are used for radius check and update serially, creating a very compact
realization for sphere decoding algorithm.
An overall 20 times area reduction is achieved for each processing element. The
proposed flexible architecture supports antenna arrays ranging from 22 to 1616,
modulations from BPSK to 64QAM, over 16 to 128 sub-carriers. With an 800 GOPS
processing capability, the peak estimated data rate exceeds 1Gbps over a 20MHz
channel.
VI. FUTURE WORK
The flexible architecture is designed to implement various search algorithms, but
the optimal algorithm to fully utilize all PEs is still being developed. Thanks to BEE2
platform enabling rapid prototyping to verify the proposed algorithms, the simulation
time can be shortened by several orders of magnitude. Soft information generation is
another important function which can be combined into the proposed sphere decoder
[31]. The ultimate goal of this work is to develop an ASIC which is the optimal
architecture in the energy-area-delay space. A sensitivity-based optimization
framework will be adopted to balance design variables for blocks in the same
abstraction layer of hierarchy as well as across layers [29]. Starting from the initial
29
realization, the design can be further optimized through wordlength optimization, gate
sizing, and VDD optimization. Finally, the fabricated chip will be tested using an
efficient FPGA-based ASIC verification flow [6]. The proposed timeline with
accomplished and remaining works is plotted in the Fig. 19.
09/06
XUP implementation
BEE2 emulation
Algorithm space exploration
Archietecture space exploration
ASIC implementation
Chip verification
Thesis writing
01/07 05/07 09/07 01/08 05/08 09/08 01/09 05/09
Chip testing platform
11/13
Fig. 19. Proposed timeline.
REFERENCES
[1] E. Zimmermann, W. Rave, and G. Fettweis, ―On the Complexity Decoding,‖ in Proc.
International Symposium on Wireless Personal Communication (WPMC’04), vol. 2, pp.
500–504, Sept. 2004.
[2] B. M. Hochwald and S. ten Brink, ―Achieving near-capacity on a multiple-antenna
channel,‖ IEEE Trans. Comm., vol. 51, pp. 389–399, Mar. 2003.
[3] G. B. Giannakis, Z. Liu, X. Ma, and S. Zhou, Space-Time Coding for Broadband
Wireless Communications, John Wiley & Sons, 2007.
[4] D. Tse and P. Viswanath, Fundmentals of Wireless Communication, Cambridge
University Press, 2005.
[5] L. Zheng and D. Tse, ―Diversity and Multiplexing: A Fundamental Tradeoff in
Multiple-Antenna Channels,‖ IEEE Transactions on Information Theory, vol. 49, no. 5,
pp. 1073–1096, May 2003.
[6] D. Markovic, A Power/Area Optimal Approach to VLSI Signal Processing, PhD Thesis,
University of California at Berkeley, 2006.
[7] F. Edman, Digital hardware aspects of multiantenna algorithms, PhD Thesis, Lund
University, 2006.
[8] A. Maltsev, V. Pestretsov, R. Maslennikov, and A. Khoryaev, ―Triangular systolic array
with reduced latency for QR-decomposition of complex matrices‖ Proc. IEEE
International Symposium on Circuits and Systems (ISCAS’06), pp. 385–388, May 2006.
30
[9] L. M. Davis, ―Scaled and decoupled Cholesky and QR decompositions with application
to spherical MIMO detection, in IEEE Wireless Communications and Networking
(WCNC’03), vol. 1, pp. 326–331, Mar. 2003.
[10] J. Lee, S. Park, Y. Zhang, K. K. Parhi, and S.-C. Park, ―Implementation Issues of A List
Sphere Decoder, ― in Proc. International Conference on Acoustics, Speech and Signal
Processing (ICASSP’06), vol. 3, pp.996–999, May 2006.
[11] M. K. Varannasi, ―Decision Feedback Multiuser Detection: A Systematic Approach,
― IEEE Transactions on Information Theory, vol. 45, no. 1, pp. 219–240, -Jan. 1999.
[12] W. Zhao and G. G. Giannakis, ―Sphere decoding algorithms with improved radius
search,‖ IEEE Transactions on Communications, vol. 53, no. 7, pp. 1104–1109, July
2005.
[13] V. Tarokh, N. Seshadri, and A. R. Calderbank, ―Space-Time Codes for High Data Rate
Wireless Communication: Performance Criterion and Code Construction,‖ IEEE
Transactions on Information Theory, vol. 44, no. 2, pp. 744–765, Mar. 1998.
[14] V. Tarokh, H. Jafarkhani, and A. Calderbank, ―Space-Time Block Code from
Orthogonal Designs,‖ IEEE Transactions on Information Theory, vol. 45, no. 5, pp.
1456–1467, July 1999.
[15] D. Gesbert, M. Shafi, D.-S. Shiu, P. J. Smith, and Ayman Naguib, ―From Theory to
Practice: An Overview of MIMO Space-Time Coded Wireless Systems,‖ IEEE Journal
on Selected Areas in Communications, vol. 21, no. 3, pp. 281–302, Apr. 2003.
[16] K.-W. Wong, C.-Y. Tsiu, R. S.-K. Cheng, and W.-H. Mow, ―A VLSI architecture of a
K-best lattice decoding algorithm for MIMO channels,‖ in Proc. IEEE International
Symposium on Circuits and Systems (ISCAS’02), vol. 3, pp. 273–276, May 2002.
[17] D. Garrett, L. Davis, S. ten Brink, B. Hochwald, and G. Knagge, ―Silicon complexity for
maximum likelihood MIMO detection using spherical decoding,‖ IEEE Journal of
Solid-State Circuits, vol. 39, pp. 1544–1552, Sep. 2004.
[18] B. Widdup, G. Woodward and G. Knagge, ―A Highly-Parallel VLSI Architecture for a
List Sphere Detector,‖ in Proc. Int. Conf. Communications, vol. 5, pp. 2720–2725, June
2004.
[19] G. Knagge, M. Bickerstaff, B. Ninness, S. R. Weller, and G. Woodward, ―A VLSI 8x8
MIMO Decoder Engine,‖ in IEEE Workshop on Signal Processing Systems, SIPS’06, pp.
387–392, Oct. 2006.
[20] G. Knagge, G. Woodward, S. Weller, and B. Ninness, ―An Optimised Parallel Tree
Search for Multiuser Detection with VLSI Implementation Strategy,‖ in Global
Telecommunications Conference (GLOBECOM’04), pp. 2440–2444, Dec. 2004.
[21] L. G. Barbero and J. S. Thompson, ―Rapid Prototyping of a Fixed-Throughput Sphere
Decoder for MIMO Systems,‖ Proc. Int. Conf. Communications, vol. 7, pp. 3082–3087,
June 2006.
31
[22] Z. Guo and P. Nilsson, ―Algorithm and implementation of the k-best sphere decoding for
MIMO detection,‖ IEEE Journal on Selected Areas in Communications, vol. 24, pp.
491–503, Mar. 2006.
[23] K. K. Parhi, VLSI Digital Signal Processing Systems, John Wiley & Sons, 1999.
[24] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Boelcskei, ―VLSI
implementation of MIMO detection using the sphere decoding algorithm,‖ IEEE
Journal of Solid-State Circuits, vol. 40, pp. 1566–1577, July 2005.
[25] L. G. Barbero and J. S. Thompson, ―Performance of the complex sphere decoder in
spatially correlated MIMO channels,‖ IEL Communications, vol. 1. pp. 122–130. Feb.
2007.
[26] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, MIT Press,
1998.
[27] J. Anderson and S. Mohan, ―Sequential coding algorithms: a survey and cost analysis,‖
IEEE Transactions on Communications, vol. COM-32, pp. 169–176, Feb. 1984.
[28] D. Markovic, B. Nikolic, and R. W. Brodersen, ―Power and Area Efficient VLSI
Architecture for Communication Signal Processing,‖ in Proc. Int. Conf.
Communications, Jun. 2006.
[29] D. Markovic, B. Nikolic, and R. W. Brodersen, ―Power and Area Minimization for
Multidimensinal Signal Processing,‖ IEEE Journal of Solid-State Circuits, vol. 42, no. 4,
pp. 922–934, Apr. 2007.
[30] C. J. Nicol, P. Larsson, K. Azadet, and J. H. O’Neill, ―A Low-Power 128-Tap Digital
Adaptive Equalizer for Broadband Modems,‖ IEEE Journal of Solid-State Circuits, vol.
32, no, 11, pp 1777–1789, Nov. 1997.
[31] C. Studer, M. Wenk, A. Burg, and H. Bolcskei, ―Soft-Output Sphere Decoding:
Performance and Implementation Aspects,‖ in Proc. Asilomar Conference on Signals,
Systems and Computer (ACSSC’06), pp. 2071–2076, Oct. 2006.
[32] A. Wiese, X. Mestre, A. Pages, and J. R. Fonollosa, ―Efficient Implementation of Sphere
Demodulation,‖ in IEEE Workshop on Signal Processing Advances in Wireless
Communications, pp. 36–40, June 2003.
[33] J. B. Anderson and S. Mohan, ―Sequential coding algorithms: A survey and cost
analysis,‖ IEEE Transactions on Communications, vol. COM-32, pp. 169–176, Feb.
1984.
[34] N. K. Sharma, ―Modular Design of a large sorting network,‖ in Third International
Symposium on Parallel Architectures, Algorithms, and Networks, pp. 362–382, Dec.
1997.
[35] P. A. Bengough and S. J. Simmons, ―Sorting-based VLSI architecture for the
M-algorithm and T-algorithm trellis decoders,‖ IEEE Transactions on Communications,
vol. 43, no. 2/3/4, pp. 514–522, 1995.
32
[36] S. Mohan and A. Sood, ―A Multiprocessor Architecture for the (M, L) algorithm suitable
for VLSI implementation,‖ IEEE Transactions on Communications, vol. COM-34, pp.
1219–1224, Dec. 1986.
[37] C. Chang, K. Kuusilinna, B. Richards, A. Chen, N. Chan, and R. W. Brodersen, B.
Nikolic, ―Rapid design and analysis of communication systems using the BEE hardware
emulation environment,‖ in Proc. IEEE Rapid System Prototyping Workshop, Jun. 2003.
[38] BEE2: Berkeley Emulation Engine 2, [Online]. Available:
http://bwrc.eecs.berkeley.edu/Research/BEE/BEE2/index.htm
[39] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design
Perspective, 2nd ed. Prentice-Hall, 2003.
[40] F. Jelinek, ―Fast Sequential Decoding Algorithm Using a Stack,‖ IBM Journal of
research and Development, vol. 13, no. 6, pp. 575-685, Nov. 1969.
[41] W.-J. Choi, R. Negi, and J. M. Cioffi, ―Combined ML and DFE Decoding for the
V-BLAST System,‖ in Proc. Int. Conf. Communications, vol. 3, pp. 1243–1248, June
2000.
[42] P. P. Vaidyanathan, Multirate Systems And Filter, Prentice Hall, 1993.
[43] H. K.-H. So, A. Tkachenko, and R. Brodersen, ―A Unified Hardware/Software Runtime
Environment for FPGA-Based Reconfigurable Computers using BORPH,‖ in Proc. Int.
Conf. Hardware/Soft Codesign and System Synthesis (CODES+ISSS’06), pp. 259–264,
Oct. 2006.
[44] W. R. Davis, N. Zhang, K. Camera, D. Markovic, T. Smilkstein, M. J. Ammer, E. Yeo, S.
Augsburger, B. Nikolic, and R. W. Brodersen, ―A design environment for high
throughput, low power dedicated signal processing systems,‖ IEEE Journal of
Solid-State Circuits, vol. 37, no. 3, pp. 420–431, Mar. 2002.
[45] B. Hassibi, ―An Efficient Square-root Algorithm for BLAST,‖ in Proc. Int. Conf.
Acoustics, Speech, and Signal Processing (ICASSP'00), vol. 2, pp. 737–740, June 2000.
[46] L. M. Davis, ―Scaled and Decoupled Cholesky and QR Decompsitions with Application
to Spherical MIMO Detection,‖ in IEEE Wireless Communications and Networking
(WCNC’03), vol. 1, pp. 326–331, Mar. 2003.
[47] D. Pham, K. R. Pattipati, P. K. Willett, and Jie Luo, ―An Improved Complex Sphere
Decoder for V-BLAST Systems,‖ IEEE Signal Processing Letters, vol. 11, no, 9, pp.
748–751, Sep. 2004.
[48] M. O. Damen, A. Chkeif, and J.-C. Belfiore, ―Lattice codes decoder for space-time
codes,‖ IEEE Communication Letters, vol. 4, no. 5, pp. 161–163, May 2000.
[49] M. O. Damen, H. El Gamal, and G. Caire, ―On Maximum-Likelihood Detection and the
Search for the Closest Lattice Point,‖ IEEE Transaction on Information Theory, vol. 49,
no. 10, pp. 2389–2402, Oct. 2003.
[50] XUP: Xilinx University Program, [Online]. Available: http://www.xilinx.com/univ