1
Massively Parallel Wireless Recon�gurableProcessor Architecture and Programming
Konstantinos Sarrigeorgidis, Jan Rabaey
Abstract�We propose a massively parallel recon�gurableprocessor architecture targetted for the implementation ofadvanced wireless communication algorithms that featurematrix computations. A design methodology for program-ming and con�guring the processor architecture is devel-oped. The design entry point is the space representationof the algorithm in Simulink. The Simulink description isparsed and the algorithm's Dependence Flow Graph is de-rived, which is scheduled and space-time mapped onto theproposed architecture. The compiler recon�gures the switchboxes of the proposed hierarchical interconnection networkin the architecture. An energy consumption model is de-rived, and design examples are provided that demonstratethe enhanced energy e�ciency of the proposed architecturecompared to a state of the art programmable DSP.
I. Introduction
The �eld of wireless communications is growing at anexplosive rate, stimulated by emerging applications such3G wireless systems and Internet mobile telephony. Thebasic trends in wireless communications are ubiquity andmore bits/sec. New standards for wireless communicationsare being developed, that boost up the data rates from 384kbits/sec up to 2 Mbit/sec. While most of the volume ofthe wireless tra�c is due to voice communications, wirelesscommunications are now moving towards data tra�c.
The need for higher bit rates has a both direct and in-direct impact on algorithmic complexity. Increasing thedata rate (bits/sec) has a direct increase on the compu-tational requirements of the algorithm. Yet, achieving ahigher bit rate would only be accomplished through a spec-tral bandwidth e�ciency and wireless channel capacity en-hancement. In order to meet these expectations, innova-tions in communication techniques are required. It is wellknown that the current spectrum e�ciency is well belowthe theoretical limit.
Capacity and spectrum e�ciency enhancement call forthe design of advanced communication algorithms andtechniques (Figure 1). Adaptive Beamforming will be usedfor a very fast wireless connection to the Internet thatfollows the user everywhere. MultiMegabit wireless accessmay be employed as an alternative to DSL systems. Multi-ple Transmit and Multiple Receive antennae systems (e.g.Bell Labs BLAST [1] algorithm) promise a linear increase(with the number of antennae) in the capacity of cellularcommunications. Multiuser detection and Interference can-cellation in CDMA systems can boost up the cellular capac-ity for more than 30% persent. These techniques can pro-vide dramatic increases in spectral e�ciency, while openingthe door for a peaceful co-existence of uncoordinated wire-less services in the same band without interference.
Most of the advanced algorithms employed for imple-
Air- Burst smart antennafor fixed wireless broadband access
BTSBTS
i-Burst Beamforming for high speedMobile Internet Access
Array
Proccesing
Array
Proccesing
rich scatteringenvironment
Multi-Transmit Multiple-ReceiveAntennae (BLAST)
CDMA Interference Suppression
~10Gops
Fig. 1. Future Wireless Communication Applications
menting these communications techniques originate fromthe Array Signal Processing society and are used in ArrayRadar systems. Key characteristic of these algorithms isthat they consist of advanced matrix computations. Someof the algebraic matrix problems that �nd application inthe design of these advanced adaptive wireless algorithmsare : QR Decomposition, LU Decomposition, Least Squaresestimation, Householder Transformations, Singular ValueDecomposition, Inverse of Matrix, Linear System Solver,FFT, and others.Advances in wireless information systems as envisioned
by the 3G and beyond, will cause a tremendous pressureto the underlying implementation technologies. Engineerswill be faced with the challenge to implement these compli-cated algorithms in a single chip with strict power dissipa-tion consumption. The growth in computational complex-ity over the generations of wireless systems is faster thanthe computational capabilities of silicon, as predicted byMoore's law (Figure 2). Yet, single chip DSPs are laggingbehind the processor's performance enhancement as o�eredby Moore's law. Exploration of novel low power processorarchitectures capable of delivering computational power inthe order of GOPS in an energy e�cient way is our focus.Traditional processors have a hard time delivering the
requested 3G requirements at the right energy level. Avariety of architectural alternatives in the energy e�-ciency/�exibility space is shown in Figure 3 [7]. The impactof providing full �exibility (in the style of Von Neumanmachine), results in three orders of magnitude in �ine�-ciency� between a DSP and a �xed custom implementation.Novel implementation architectures, based on recon�gura-tion and parallel/space programming provide performance
2
Fig. 2. Algorithmic Compexity
W
Fig. 3. Energy e�ciency versus �exibility
and �exibilty at energy levels close to ASIC as shown inFigure 3. We propose a massively parallel recon�gurablearray processor architecture for meeting the computationalrequirements of 3G and beyond wireless communications asit is illustrated in Figure 4.The implementation of a communication algorithm be-
gins with a mathematical description of the computationsinvolved and concludes with a hardware realization of thealgorithm. The traditional programming or �multiplexing-in-time� concept advocated in the instruction-set processor,results in an ine�cient way to map the algorithm to thespeci�c hardware platform. Instead, we propose a �pro-gramming in space� approach, where the algorithm is par-allel represented in space. By employing this approach, allthe available parallelism inherent in the communication al-gorithms, is revealed. We propose a design �ow, where analgorithm represented in Simulink is compiled to a recon-
Advanced Communication Processors
MASSIVELY PARALLEL RECONFIGURABLE ARRAY PROCESSING!
Three orders of Magnitudein Energy Efficiency
ProcessingNode
100 Processing nodes and more… Tens of Gops!
Maia Reconfigurable Processor4 Processing Nodes(MAC,ALU)
10-100MOPS/mW
Maia Reconfigurable Processor4 Processing Nodes(MAC,ALU)
10-100MOPS/mW10-100MOPS/mW
TI C55x VLIW DSP5 Processing Nodes
.5-5MIPS/mW
TI C55x VLIW DSP5 Processing NodesTI C55x VLIW DSP5 Processing Nodes
.5-5MIPS/mW
Dedicated Hardware
100-1000MOPS/mW
Dedicated HardwareDedicated Hardware
100-1000MOPS/mW
Tens of Gops!
Fig. 4. Advanced Communication Processors
Channelestimation& tracking
MIMOAntennae
CDMAMultiuserDetection
Smart AntennaeBeamforming Equalization
FormulateProblem
KalmanFilter
LMS
RLS QR-RLSSVD
LeastSquares
…
………
…
…
GivensRotations
HouseholderReflections
MatrixMultiplications
…
Inner, Outervector products
Advanced Wireless Algorithms
Adaptive Algorithms
Matrix Computations
…QRDecomposition
Fig. 5. Solving Advanced Communication Algorithms
TABLE I
Square-root covariance filter
Square-root covariance �lter[1 uH(n)K1/2(n− 1)0 λ−1/2K1/2(n− 1)
]Θ(n) =
[r(n) 0T
g(n)r1/2(n) K1/2(n)
]g(n) = (g(n)r1/2(n))(r1/2(n))−1
a(n) = y(n)− uH(n)x(n |Υn−1)
x(n + 1 |Υn) = λ−1/2x(n |Υn−1) + g(n)a(n)
�gurable array processor architecture, while performing anumber of transformations to the algorithm's dependence�ow graph.
II. Advanced Wireless CommunicationsAlgorithms
In order to illustrate our concept we now provide someexamples of advanced wireless communication algorithms.Referring to Figure 5, we claim that these communicationalgorithms can be formulated and solved under a commonframework by using well known and studied techniques inadaptive �lter theory such as Kalman �ltering, Least MeanSquares algorithm (LMS), Recursive Least Squares algo-rithm (RLS), QR decomposition based RLS, Singular ValueDecomposition (SVD), Least squares methods, and othertechniques. Most of these adaptive �ltering algorithms,consist of recursions on a set of matrix computations, e.g. :QR decomposition, Householder triangularization, Givensrotations, Matrix multiplications, Householder re�ections,inner and outer vector products, and others. The basicidea is illustrated in Figure 5.Many wireless communication problems, e.g. : equal-
ization, channel estimation, MIMO channel tracking, canbe formulated as stochastic least squares problems, andtherefore can be solved by employing Kalman �ltering [3].A square-root implementation of Kalman �ltering is shownin Table I. Adaptive nulling and beamforming [2], CDMAmultiuser detection and interference cancellation [4], mul-tiple receive/transmit antennae nulling and cancellation(BLAST), equalization, can be formulated as a determin-istic least-mean-squares problem, and solved by the recur-
3TABLE II
The QR recursive least squares algorithm
QR-RLS
Initialize : Φ1/2(0) = 0, p(0) = 0
for n = 1, 2, . . . ,[λ1/2Φ1/2(n − 1) u(n)
λ1/2p∗(n − 1) d(n)
0T 1
]Θ(n) =
[Φ1/2(n) 0
p∗(n) ξ(n)γ1/2(n)
u∗(n)Φ−∗/2 γ1/2(n)
]w∗(n) = p∗(n)Φ−1/2(n)
TABLE III
Square-Root BLAST algorithm
Square - Root BLAST
Initialize : P1/20 = 1√
aIM×M, B0 = 0N×M
for n = 1, 2, . . . , 1 HiP1/2i−1
0 P1/2i−1
−ei Bi−1
Θi =
[r1/2i 0
Kp,i P1/2i
Ai Bi
]P1/2 = P
1/2N , Qa = BN.
P1/2Σ =
[P(M−1)/2 P
(M−1)/2M
0 p1/2M
]Qa ← QaΣ.
sive least squares algorithm (RLS) [3]. One form of theRLS algorithm, based on QR decomposition, is shown inTable II. A square root implemenation [1] of the BLASTalgorithm is described in Table III. Singular value decom-position (SVD), is a very useful tool for space processingin MIMO wireless systems. Channel estimation and track-ing in multiple receive/transmit antenna, can be solved byupdating the singular value decomposition of the data ma-trix. An algorithm for updating the SVD of a data matrix[5] is shown in Table IV.
III. Space Programming of Wireless Algorithms
Mapping a communications algorithm to a parallel re-con�gurable hardware platform starts with a space rep-resentation of the algorithm, describing the actual space-time activities in the form of a Dependence Flow Graph
TABLE IV
The FST Subspace Tracking Algorithm
Type of Matrix Computation Operations
Householder transformation
xs
= V∗s x
z = x −Vsxs
vn
= z/||z||
QR Update (Givens Rotations)
√λR → R
Q
[R 00 σn
x∗s
||z||
]→
[R s0 s
]QR Update
[R s0 s
]Qa →
[R 0
sT s
]QR Update Qβ
[R 0
sT s
]→
[R s
0 σ
]Matrix-Givens Product [Vs | vn]Qa → [Vs vn]
b
A
x
x=A A bT
V
Q
v
V
V v Q V v
A
b
x=A\b
x
R
Z-1
x
Q
R
R
Z-1
Z-1
Z-1
{input}r
Rx RQ
{output}b
A
x
x=A A bT
x=A A bx=A A bT
V
Q
v
V
V v Q V vV v Q V v
A
b
x=A\b
x
A
b
x=A\b
x
R
Z-1
Z-1
Z-1
x
Q
R
R
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
{input}r
Rx RQRx RQ
{output}
w =V V r;T
w =V V r;T
R = Rw
Rw =Q RRww =Q R
V w Q V w
T
=T
x = R\xx = R\x
;
;
;
;
0
R =triag(n);R =triag(n);V =eye(n,n);V =eye(n,n); x=ones(n,1);x=ones(n,1);
Algorithm in MATLAB code
MDL DGGeneratorMDL DGGenerator
MDL DGGeneratorMDL DGGenerator
MDL DGGeneratorMDL DGGenerator
Simulink MDLDependence Graphmodel file
Fig. 6. Programming algorithms in space
[8]. Simulink is an excellent environment for �system level�capture, design and simulation of wireless communicationalgorithms, and constitutes the design entry point for the�algorithm to recon�gurable architecture� design �ow.The way we currently program algorithms in space inside
the Simulink environment is basically primitive. The de-signer has to build up the whole application from scratch bychoosing primitive hardware blocks (adders, registers, mul-tipliers, etc) from the Simulink library. This approach isextremely time consuming and ine�cient, considering thealgorithms we study. We propose a method to automatethis design process, by means of providing a dependencegraph generator library of matrix computations [10]. Eachblock inside this library is a masked Simulink block, thatexecutes a MATLAB script that generates the DependenceFlow Graph of a matrix computation. The concept of thisapproach is illustrated in Figure 6, by providing an exam-ple that transforms the MATLAB code of an algorithm toits parallel dependence �ow graph.The MATLAB (.m) dependence graph generators are
based on built-in MATLAB commands that allow the userto import blocks in a Simulink model by writting code.Examples of these commands are add_block, add_line
[10]. An example of a matrix computation dependence�ow graph is shown in Figure 7.
IV. Massively Parallel ReconfigurableArchitecture
A. Architecture Model and Interconnection Network
In this section we derive the proposed architecture modelby elaborating on the problem of mapping computationalgraphs to processor arrays. The problem of mapping ma-trix computation dependence graphs to array architectureshave been addressed by the systolic array community [6]. Adesign methodology has been developed that transforms amatrix computation to a �xed array structure. The problemof mapping advanced wireless communication algorithmsto an array processor boils down to transforming a set ofconcurrent matrix computations to a recon�gurable array
4
MUL/
ADD
A11
MUL/
ADD
A21
MUL/
ADD
A12
MUL/
ADD
A22
MUL/
ADD
A13
MUL/
ADD
A23
MUL/
ADD
A14
MUL/
ADD
A24
b1
b2
b3
b4
x1
x2
x3
x4
MUL/
ADD
A11
MUL/
ADD
A12
MUL/
ADD
A13
MUL/
ADD
A14
MUL/
ADD
A21
MUL/
ADD
A22
MUL/
ADD
A23
MUL/
ADD
A24
MUL/
ADD
A11
MUL/
ADD
A21
MUL/
ADD
A12
MUL/
ADD
A22
MUL/
ADD
A13
MUL/
ADD
A23
MUL/
ADD
A14
MUL/
ADD
A24
MUL/
ADD
A31
MUL/
ADD
A32
MUL/
ADD
A33
MUL/
ADD
A34
MUL/
ADD
A41
MUL/
ADD
A42
MUL/
ADD
A43
MUL/
ADD
A44
x=A A bT
Fig. 7. Dependence �ow graph of ATA · b
structure.
The dependence �ow graph of the communication al-gorithms we study consist of many interconnected depen-dence graphs of di�erent matrix computations in parallel.To demonstrate our concept, consider the Singular ValueDecomposition update algorithm described in Table IV. ASimulink block diagram of this algorithm is shown in Fig-ure 8. The dependence graph of this algorithm consists ofa set of matrix computation's dependence graphs such as :matrix-vector multiplication, QR update, and Givens rota-tions. The connections inside the dependence graph of eachmatrix computation are considered local, while the connec-tions between the dependence graphs of di�erent matrixcomputations are considered global. An energy e�cientspace mapping should target at minimizing the global wiretransfers, and �transform� the global connections to local.
By visualizing the set of the concurrent matrix compu-tations, in a three dimensional plane, we come up with theconcept of overlay of concurrent matrix computations, as itis demonstrated in Figure 9. We can transform the globalconnections shown in Figure 8 (corresponding to the depen-dence �ow graph domain) to local connections (correspond-ing to the hardware processing units domain) by groupinga computation task from each of the computations in theoverlay and form a tightly connected SuperNode processorthat consists of the processing units that execute the setof computational tasks we grouped together. This is illus-trated in Figure 9. The global connections between thecomputational tasks in the original dependence graph havenow been transformed into local IntraSuperNode connec-tions between the processing units inside the SuperNode.The local connections inside the dependence graph of eachmatrix computation in Figure 8, are now transformed intolocal InterSuperNode connections. We refer to the resultinginterconnection network model as a hierarchical overlay ofmeshes, and it is illustrated in Figure 9.
It may be the case that the set of concurrent matrixcomputations exceeeds four which was the case in our ex-ample. In such a scenario we employ a more powerful su-pernode to boost up the computing power of the array.In order to model the supernode architecture we intro-duce the Processing Unit Bank concept which is cluster
MATRIXVECTOR
QRUpdate
MATRIXVECTOR
TRANSPOSE
QRUpdate
TRANSPOSE
GivensRotations
Problem : Space Mapping of SVD
LOCAL
GivensRotations
GLOBAL
V
Q
x s
xk
Fig. 8. Space mapping problem of SVD
v/cordic alu
mul r/cordic
SuperNode
v/cordic aluv/cordic alu
mul r/cordic
SuperNode
Givens Matrix
QR Update
Matrix Mult
QR Update
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av am r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av am r
v a
m r
v av a
m r
v a
m r
v av a
m r
v a
m r
v av a
SuperNodeOverlay of ParallelMatrix Computations
m r
v a
m r
v av a
Fig. 9. SuperNode and Overlay of Parallel Computations
of processing elements of the same arithmetic operation.Each supernode consists of four processing unit banks : theMUL/ADD processor bank, the R/CORDIC (rotate) andV/CORDIC (vector) processor bank, and the ALU pro-cessor bank. For exampe a supernode of a type [2,2,1,1]consists of 2 MUL/ADD, 2 R/CORDIC, 1 V/CORDIC,and 1 ALU units. The �nal architecture will be an hetero-geneous array of supernodes, connected with an overlay ofmeshes interconnection network as it is illustrated in Figure10. The architecture of the SuperNode is also shown. Werefer to the interconnection network inside the supernodesas the IntraSuperNode or IntraCluster Network, and theinterconnection network between the supernodes as the In-terSuperNode or InterCluster Network. In a scenario thatthe number of parallel computations exceeds the number ofavailable processing elements inside the supernode, hard-ware multiplexing or folding will be applied, as it is ex-plained later.
B. Energy Model
The Intercluster Network consists of an array of Meshesnetworks, where each output port of a supernode (cluster)routes the information along its own mesh network, thusproviding interconnection network parallelism as it is il-lustrated in Fig. 11. The IntraCluster network is shownin Figure 12, and consists of a mesh network, along witha number of multiplexers, each multiplexer for each pro-
5
[2 2 2 2] [2 1 2 1]
[2 2 2 1] [2 2 2 2] [2 2 2 2]
[1 2 2 1] [1 1 1 2] [2 2 2 2]
[3 1 1 1][2 2 2 2] [2 1 2 1]
[2 2 2 1] [2 2 2 2] [2 2 2 2]
[1 2 2 1] [1 1 1 2] [2 2 2 2]
[3 1 1 1][2 2 2 2] [2 1 2 1]
[2 2 2 1] [2 2 2 2] [2 2 2 2]
[1 2 2 1] [1 1 1 2] [2 2 2 2]
[3 1 1 1]
In In
In InMul R
AluV
Switches
Buffer
Cluster/SuperNode
[2 2 2 2] [2 1 2 1]
[2 2 2 1] [2 2 2 2] [2 2 2 2]
[1 2 2 1] [1 1 1 2] [2 2 2 2]
[3 1 1 1][2 2 2 2][2 2 2 2] [2 1 2 1][2 1 2 1]
[2 2 2 1][2 2 2 1] [2 2 2 2][2 2 2 2] [2 2 2 2][2 2 2 2]
[1 2 2 1][1 2 2 1] [1 1 1 2][1 1 1 2] [2 2 2 2][2 2 2 2]
[3 1 1 1][3 1 1 1][2 2 2 2] [2 1 2 1]
[2 2 2 1] [2 2 2 2] [2 2 2 2]
[1 2 2 1] [1 1 1 2] [2 2 2 2]
[3 1 1 1][2 2 2 2][2 2 2 2] [2 1 2 1][2 1 2 1]
[2 2 2 1][2 2 2 1] [2 2 2 2][2 2 2 2] [2 2 2 2][2 2 2 2]
[1 2 2 1][1 2 2 1] [1 1 1 2][1 1 1 2] [2 2 2 2][2 2 2 2]
[3 1 1 1][3 1 1 1][2 2 2 2][2 2 2 2] [2 1 2 1][2 1 2 1]
[2 2 2 1][2 2 2 1] [2 2 2 2][2 2 2 2] [2 2 2 2]
[1 2 2 1][1 2 2 1] [1 1 1 2][1 1 1 2] [2 2 2 2][2 2 2 2]
[3 1 1 1][3 1 1 1]
In In
In InMul R
AluV In In
In InMul R
AluV
Switches
Buffer
Cluster/SuperNode
Fig. 10. Heterogeneous array of SuperNodes
cessing unit input port. The multiplexers select the ap-propriate signal among all possible processing units outputports from the same cluster, and the signal coming froman external cluster (Ini input signals).For example, assume we need to route in parallel the
following connections : the outport of the multiplier inCluster23 has to be routed to the (b) port of the ALUprocessing unit in Cluster11, and in parallel the (r) portof the RCORDIC processing unit in Cluster13 has to berouted to the (z) port of the MUL processing unit in theCluster21. The con�guration of the InterCluster and Intra-Cluster switches is shown in Figures 11, and 12 respectively.The RC model of the Intercluster and Intracluster inter-
connect routing is depicted in Figure 13. The capacitanceCd and resistance Ron denote the di�usion capacitance andthe resistance of the pass transistor switch respectively.The quantities Rlwand Clw denote the resistance and ca-pacitance of the intracluster (local) wire, while the quan-tities Rgw and Cgw denote the resistance and capacitanceof the intercluster (global) wire. Denote byNghops , andNlhops the number of hops inside the Intercluster and Intr-acluster network respectively, and by b the datapath width(e.g. 16 bits). Based on Figure 13 the total energy dissi-pated on the interconnection network is given by
EIN = NgEglobal + NlElocal
= Ng [(7Cd + Clw) + Nghops(6Cd + Cgw)
+ Nlhops(6Cd + Clw) + 14Cd)]V · Vswingb]
+ Nl[(14Cd + Clw)V · Vswingb] (1)
The total energy is given by ETOT = EPU + EMEM +EIN , where the �rst term is contributed from the process-ing elements, the second from data bu�ering, and the thirdfrom the interconnection network. An Elmore RC delaymodel is developed in [10]
V. Simulink to Reconfigurable Model Compiler
An overview of the proposed design methodology isshown in Figure 14. The �rst step of the design �ow isillustrated in Figure 6. This dependence �ow graph thatrepresents the algorithm in parallel, as well as the architec-ture model consitute the input to the compiler. The outputof the compile process is an executable recon�gured archi-tecture Simulink model. Each computational node in the
...
...
..
...
...
..
...
...
..
...
...
..
…Cluster11
Cluster21
Cluster12
Cluster22
Cluster13
Cluster23
out out out out
…in in in in
…1
1 2
2
3
3 4
4out out out out
…in in in in
…outout outout outout outout
…in in in in
…ininin ininin ininin ininin
…1
1 2
2
3
3 4
4out out out out
…in in in in
…1
1 2
2
3
3 4
4out out out out
…in in in in
…outout outout outout outout
…in in in in
…ininin ininin ininin ininin
…1
1 2
2
3
3 4
4out out out out
…in in in in
…1
1 2
2
3
3 4
4out out out out
…in in in in
…outout outout outout outout
…in in in in
…ininin ininin ininin ininin
…1
1 2
2
3
3 4
4
out out out out
…in in in in
…1
1 2
2
3
3 4
4out out out out
…in in in in
…outout outout outout outout
…in in in in
…ininin ininin ininin ininin
…1
1 2
2
3
3 4
4 out out out out
…in in in in
…1
1 2
2
3
3 4
4out out out out
…in in in in
…outout outout outout outout
…in in in in
…ininin ininin ininin ininin
…1
1 2
2
3
3 4
4 out out out out
…in in in in
…1
1 2
2
3
3 4
4out out out out
…in in in in
…outout outout outout outout
…in in in in
…ininin ininin ininin ininin
…1
1 2
2
3
3 4
4
Fig. 11. Overlay of Meshes InterCluster Network
r sign xx y z
MUL RCORDIC
r x
VCORDIC
a b
ALU
z
Out1
Out1Out2Out3Out4Out5
r sign x
Out2 Out3 Out4
Out6
sign r
Out5 Out6 Out7
c
Out7
In1 In2 In3 In4
In5In6 In7 In8
r sign xx y z
MUL RCORDIC
r x
VCORDIC
r x
VCORDIC
a b
ALU
z
Out1Out1Out1
Out1Out1Out2Out2Out3Out3Out4Out4Out5Out5
r sign x
Out2Out2Out2 Out3Out3Out3 Out4Out4Out4
Out6Out6
sign r
Out5Out5Out5 Out6Out6Out6 Out7Out7Out7
c
Out7Out7
In1InIn1 In2InIn2 In3InIn3 In4InIn4
In5InIn5In6InIn6 In7InIn7 In8InIn8
Fig. 12. [1111] IntraCluster Network
dependence graph is modeled as a C-MatlabEXecutable S-function that it is linked to our developed �xed point ma-trix computation Library [10]. Each functional block insidethat library is a masked S-function. The mask parametersspecify the hardware characteristics of the processing nodessuch as : �xed point data type, number of pipelines stages,number of cordic iterations, and others.
In order to process the dependence graph of an algo-rithm represented as a Simulink model, we need to trans-form the Simulink description (.mdl �le) into a C++ datastructure. Our compiler transforms the .mdl �le to a C++direct graph. The program is written in C++, and con-sists of a lexical analyzer mdllexer.l (using Flex ), and aparser mdlparser.y (using Bison). The output of the pars-ing process is a C++ graph that represents the dependency�ow graph of an algorithm. A Simulink .mdl �le is a hier-archical tree structured ASCII �le that contains keywordsand parameter-value pairs that describe the model. Themdl2dfg program builds this hierarchical tree as a C++list of linked lists, after the parsing process. This tree iswalked down such that all the dependency edges, and thedelays associated with each edge, are extracted.
The computational nodes of the graph are subsequently
6
7Cd7Cd
Rlw
Rlw
Clw
…3Cd 3Cd
R on
Cgw
Rgw
3Cd3Cd 3Cd3Cd
R onR on
CgwCgw
RgwRgw
3Cd 3Cd
R on
Cgw
Rgw
3Cd3Cd 3Cd3Cd
R onR on
CgwCgw
RgwRgw
3Cd 3Cd
R on
Cgw
Rgw
3Cd3Cd 3Cd3Cd
R onR on
CgwCgw
RgwRgw
…3Cd 3Cd
R on
Clw
Rlw
3Cd3Cd 3Cd3Cd
R onR on
ClwClw
RlwRlw
3Cd 3Cd
R on
Clw
Rlw
3Cd3Cd 3Cd3Cd
R onR on
ClwClw
RlwRlw
3Cd 3Cd
R on
Clw
Rlw
3Cd3Cd 3Cd3Cd
R onR on
ClwClw
RlwRlw
3Cd3Cd 4Cd4Cd
R onR on
7Cd
RonRon
Number of InterCluster Hops
OutputPin
Number of IntraCluster Hops
InputPin
RC Equivalent to InterCluster Routing
7Cd
R lw
7Cd
RonInputPin
OutputPin
Clw
7Cd7Cd
R lwR lw
7Cd
RonRonInputPin
OutputPin
Clw
Clw
RC Equivalent of IntraCluster Routing
Fig. 13. RC equivalent to Inter and Intra Cluster routing
.mdl Simulink file.mdl Simulink file
c
mdl2dfgFlex/Bison parser
Simulink .mdl file
C++ direct graph
SchedulingRetiming
Mapping minimizescost function
(Energy)
Folding EquationsHardware multiplexing
Masked C-MEX S-functionDialog Box Parameters:
•Hardware parameters :# Cordic rotations, pipeline stages,fixed point data type•# Cluster, # Proc Unit , this taskis assigned to•etc …
Masked C-MEX S-functionDialog Box Parameters:
•Hardware parameters :# Cordic rotations, pipeline stages,fixed point data type•# Cluster, # Proc Unit , this taskis assigned to•etc …
SpecifyArchitectureSpecify
Architecture
ClusterLibrary
Reconfigure
.m MATLAB script
Reconfigure
.m MATLAB script
Configuration Data :•Switches•Buffers•Processor Scheduling
Configuration Data :•Switches•Buffers•Processor Scheduling
In In
In Incordic alu
mulcordic In In
In Incordic alu
mulcordic
network network
network
network
ClusterSwitch
Buffer
ReconfiguredArchitecturein SimulinkCluster
WirelessAlgorithmin MATLAB
DependenceGraphGeneratorLibrary
DependenceGraphGeneratorLibrary
Manual PartitioningTask assignment
Fig. 14. Design �ow methodology
scheduled by an as soon as possible ASAP scheduling rou-tine described in section V.A. The dependence graph hasto be partitioned to �t the target array architecture andeach computational task in the graph should be assignedto a processing unit inside a SuperNode. This mappingprocess is described in section V.B.
At this particular point of the design �ow, each objectnode in the C++ graph has three properties associatedwith it : the SuperNode unit that a computational task isassigned to, the Processing unit inside that supernode thatexecutes that computational task, and the time step thatthis operation takes place (Schedule). The three object�elds of each node are input to the Hardware Multiplex-ing/Folding Transformation block which derives con�gu-rations for the switchboxes in the array of meshes network,and programs the memory bu�ers of each of the processingunits. Finally, the Reconfigure.m MATLAB script readsthe con�guration data from a �le, and it automaticallygenerates the recon�gured architecture model, as an ex-ecutable Simulink .mdl �le. The initial algorithm writtenin MATLAB, and the executable recon�gured architectureshould have the same behavioral functionality.
A. Scheduling
The scheduling algorithm employed in our design �owis based on the �As soon as possible� ASAP schedulingtechnique [8]. This algorithm is a modi�cation of the orig-inal ASAP scheduling, which assumes unlimited resources.We modify the ASAP scheduler by taking into considera-tion the limited number of resources. Denote the time stepwhich a particular node n is assigned to be executed on theprocessing element PEn, as time step TSn. Assume alsothat the computation takes Cn time steps. Our schedulingalgorithm has as follows :
Input : DG =(N,E)
Output : Scheduled Graph
∀ node n ∈N : Predecessorsn = NULL ⇒TSn = 0while(number of unscheduled nodes > 0)
∀ node nj : if ( all Predecessorsnjare scheduled )
TSj = maxTSi + Ci ∀ predecessor iwhile (∃ node nk : TSk = TSj && PEj = PEk)
TSj = TSj + 1;number of unscheduled nodes+1;
B. Mapping
The problem of assigning tasks of a larger size graph ofcomputations to a parallel computer of a smaller size isreferred to as algorithm partitioning [6]. This problem isalso known in the systolic design community, as space-timemapping of computations onto a �xed size array processor.The processor allocation problem involves the partition-
ing of the dependence graph into many blocks, each consist-ing of a cluster of nodes in the dependence graph. The twomost known methods of partitioning a dependence graph,and assigning computational tasks to processing elementsare : the locally sequential globally parallel (LSGP) method,and locally parallel globally sequential (LPGS) method [6].As it is demonstrated in [10], the LPGS scheme, which ismore tailored to matrix computations, su�ers from exten-sive global connectivity (energy dissipation) each time theintermediate blocks results are loaded into the array forprocessing the next block. The Interleaved LPGS spacemapping is proposed in [10], which only requires local con-nections between the processing elements, in the expense ofrecon�guring the interconnection network from a processedblock to the next block basis, as it is illustrated in Figure15.These mapping techniques originate from the systolic ar-
ray community, which try to map a matrix computationto a homogeneous �xed array processor. Mapping wirelesscommunications to recon�gurable architectures, requiresthe transformation of a set of concurrent matrix computa-tions to an array of heterogeneous supernodes. We proposetwo ways to solve this problem. The �rst technique appliesthe interleaved LPGS partitioning technique to each of thematrix computations involved, which maps the computa-tion along the supernode array as well as to the processingunit responsible for executing the particular type of ma-trix computation (e.g. R/CORDICs for Givens rotations).This process is repeated for all the computations, until all
7
P0 P1 P2
P3 P4 P5
P6 P7 P8
t=0
P8 P7 P6
P5 P4 P3
P2 P1 P0
t=3, change directionalong i,j axis
permutecolumnsrows
P2 P1 P0
P5 P4 P3
P8 P7 P6
permutecolumns
t=2, change directionalong i axis
Proposed PartitioningInterleaved Cut and Pile
P6 P7 P8
P3 P4 P5
P0 P1 P2
t=1, change directionalong j axis
permuterows
Configureconnections
P6 P7 P8
P3 P4 P5
P0 P1 P2
t=1, change directionalong j axis
permuterows
P6 P7 P8
P3 P4 P5
P0 P1 P2
t=1, change directionalong j axis
permuterows
P6 P7 P8
P3 P4 P5
P0 P1 P2
P6 P7 P8
P3 P4 P5
P0 P1 P2
t=1, change directionalong j axis
permuterows
Configureconnections
Fig. 15. Proposed Interleaved LPGS
the processing units in a supernode have been assigneda task to execute. Subsequently, hardware multiplexing/folding is applied as it is explained in the next subsection.The second technique, maps the entire dependence �owgraph of the algorithm to the supernode array and to theprocessing units, by means of minimizing a cost function.Since our cost metric is the energy e�ciency, this mappingmethod tries to minimize the Interconnection equivalentDistance de�ned as ID = α · Ninter + β · Nintra, whereNinter, Nintra are the totel number of hops in the InterSu-perNode and IntraSuperNode mesh networks, and α, β areweight factors.
C. Folding Transformation
Systematic design of hardware multiplexing is achievedthrough the folding transformation [8]. This transforma-tion derives control circuits for DSP architectures wheremultiple arithmetic operations are assigned to be executedto a single functional unit. Folding is used in order to foldthe fully parallel space representation of an algorithm (DG)into the recon�gurable �xed size parallel processor. Theinput to the folding transformation is the scheduled andspace-mapped dependence graph of the algorithm. Theoutput of the folding transformation re-schedules (in casewe apply retiming) the processing elements in the SuperN-ode, con�gures the switches in the interconnection network,and programs the memory bu�ers of each of the processingelements.After the folding transformation the original dependence
graph (graph of computation tasks) is transformed to afolded graph, where now the graph represents the actualprocessing units in the architecture, rather than the com-putation tasks as in the original graph. The vertices ofthe folded graph are the available processing elements in-side the clusters, and the edges represent the number ofdelays the result from the source processing element mustbe stored, before it is used from the destination processingelement.For a folded system to be realizable, the folded data path
delays [8] must be nongegative. In case the folding trans-formation is nonrealizable we apply retiming to change thedelays in the original graph. This retiming problem can bereformulated as a shortest path problem and can be solvedby employing the Bellman-Ford algorithm. If a solution
TABLE V
Energy breakdown for the biquad filter
Architecture Module Energy (pJ)
Interconnection Network 44Processing Units 100
Memory 84
Total 228
exists, then the retiming value r(X) is the shortest pathfrom an arbitary node O to node X.
VI. Design Examples
We provide two design examples where the algorithmsare �rst described in Simulink, and our compiler maps themonto the recon�gured architecture. The �rst example is abiquad �lter, and the second example is the SVD updatealgorithm. The energy e�ciency of our architecture is com-pared with the energy e�ciency of one of the state of theart Texas Instruments' DSPs.
A. Biquad Filter
The biquad �lter (Figure 6.3 in [8]) is manually mappedto the processing elements in the architecture which con-sists of one Cluster. This �lter is mapped to two processingelements : one MUL processing and one ALU unit. TheMUL hardware module has 2 pipeline stages and the ALUmodule one pipeline stage. The Simulink model of the �l-ter is parsed, and the algorithm's graph is scheduled andretimed in order to be folded and executed by the two pro-cessing modules.
Our compiler also computes the number of interconnec-tions that need to be routed between di�erent processingelements, the times each processing unit is scheduled duringone iteration of the algorithm, and the number of memorybu�er operations for one execution of the algorithm. Sub-stituting these numbers into the energy dissipation modelin Eq.1 we obtain the energy estimation breakdown shownin Table V.
The energy e�ciency of the proposed architecture whileexecuting the biquad �lter is Energyeff = OP/Energy =
8228pJ = 35 MOPS/mW.
The power consumption of the Texas InstrumentsTMS320C64xx processor family is now discussed. Thefollowing table shows the power consumption of theTMS320C641 processor, at 50% high/50% low activity.
Processor Baseline (W) Activity (W) Total (W)600 MHz/1.4V 0.25 0.17 0.42500 MHz/1.2V 0.15 0.11 0.26
The biquad �lter was described in C by using the TexasInstruments Code Composer Studio, and the code was com-piled and run on the TMS320C641 processor. The numberof clock cycles to execute the biduad �lter was found to be72. Therefore, the energy e�ciency of the TMS360C641
8
Fig. 16. SVD update Simulink representation
TABLE VI
Energy breakdown for the SVD algorithm
Energy Energy (pJ)
Interconnection 1660Processing Units 1580
Memory 875
Total 4115
processor while executing the biquad �lter is Energyeff =OP/Energy = 8
0.26W ·72/500·106 ≈ 0.25 MOPS/mW
B. SVD Update
The Simulink representation of the SVD update algo-rithm is shown in Figure 16. The Simulink model is com-piled to a dependence graph, which is scheduled, mapped,and folded to an 4 by 4 array cluster. An estimation of theenergy dissipation contributed from each module is shownin Table VI. The histograms of the interconnection routingalgorithm requirements are shown in Figure 17.The worst case delay across the interconnection network
was found to be delay ≈ 9.15 · 10−8sec, which means thatthe architecture can be clocked as fast as 10MHz. Thepower consumption is P = 4115pJ
N ·delay ≈ 20mW , where N is
the folding factor which was 2 in this example. The 4×4array cluster is capable to deliver
(AAT b)op + (a− b)op + 2 ·QRop + (V ·Q)op
N · delay=
72 · 108
18.3≈ 400MOPS
while executing the SVD algorithm. The energy e�ciency
0 2 4 60
10
20
30
40
50
60
70
80
InterCluster Number of Hops
Num
ber
of o
ccur
ence
s
Histogram of routing distances
0 1 2 3 40
10
20
30
40
50
60
IntraCluster Number of Hops
Num
ber
of o
ccur
ence
s
Histogram of routing distances
Fig. 17. Distribution of distance travelled across interconnectionnetwork
TABLE VII
Energy Efficiency Comparison
Energy E�ciency MOPS/mW
Architecture Biquad Filter SVD
TMS320C64xx (.15µ, 1.2V) 0.25 0.013
Proposed Architecture (.25µ, 1V) 40 17
Improved Energy E�ciency Factor 160 1300
Pleiades [7]( 0.25µ, 1V ): 10-80 MOPS/mW (VCELP algorithm)
of our architecture for the SVD algorithm is Energyeff =OP/Energy = 72
4115pJ ≈ 17 MOPS/mW . The SVD up-date algorithm was written in C by using the Code Com-poser Studio in order to simulate the TMS320C641X DSPprocessor. After compiling and pro�ling the total num-ber of clock cycles for the SVD algorithm was found tobe N=11,311 clock cycles. The TMS320C64XX DSP isable to deliver 72
11311/500·106 ≈ 3.2 MOPS when execut-
ing the SVD algorithm. Its average power consumptionis 0.25W. Therefore its energy e�ciency is Energyeff =3.2 MOPS/0.26W ≈ 0.013 MOPS/mW . A comparisonbetween the energy e�ciency of the recon�gurable archi-tecture and the TMS320C641 processor is shown in TableVII. We conclude that there is a 2 to 3 orders of magni-tude di�erence in energy e�ciency between the DSP andthe recon�gurable architecture.
VII. Conclusions
We proposed a massively parallel recon�gurable archi-tecture and a compiler that transforms advanced wirelesscommunication algorithms expressed in Simulink to an ex-ecutable Simulink model that represents the recon�guredarchitecture. We provide examples that demonstrate a twoto three orders of magnitude in energy e�ciency betweenthe proposed arhitecture and a state of the art DSP pro-cessor.
References
[1] B. Hassibi, �An e�cient Square-Root Algorithm for BLAST�,IEEE Conf. on Acoustics, Speech and Signal Processing, 2000,vol2, pp.737-740
[2] C. M. Rader �VLSI Systolic Arrays for Adaptive Nulling�, IEEESignal Processing Magazine, July 1996, pp. 29-49
[3] S. Haykin, Adaptive Filter Theory , Prentice Hall 1996[4] X. Wang, H. V. Poor, � Blind Multiuser Detection: A Subspace
Approach�, IEEE Trans. on Information Theory, vol. 44, No. 2,March 1998
[5] D. J. Rabideau. Fast, Rank Adaptive Subspace Tracking andApplications. IEEE Trans. on Signal Processing, vol. 44, No. 9,September 1996
[6] S. Y. Kung, �VLSI Array Processors�, 1988 Prentice Hall[7] J. Rabaey, �Silicon Platforms for the next generation wireless
systems - What role does recon�gurable hardware play?�, Pro-ceedings FPL 2000, Austria, August 2000
[8] K.K. Parhi �VLSI Digital Signal Processing Systems�, 1999 JohnWiley
[9] K. Sarrigeorgidis, Jan Rabaey �Ultra Low Power CORDIC Pro-cessor for Advanced Wireless Communications Algorithms�, sub-mitted in VLSI Signal Processing Systems
[10] K. Sarrigeorgidis �Massively Parallel Recon�gurable Wire-less Processor Architecture and Programming � Ph.D Thesis,U.C.Berkeley, Dec, 2002