Download pdf - 1 Massively Parallel Wireless Recon gurable Processor ... · Processor Architecture and Programming ... Capacity and spectrum e ciency enhancement call for ... -1 x Q R R Z-1 Z-1

1

Massively Parallel Wireless Recon�gurableProcessor Architecture and Programming

Konstantinos Sarrigeorgidis, Jan Rabaey

Abstract�We propose a massively parallel recon�gurableprocessor architecture targetted for the implementation ofadvanced wireless communication algorithms that featurematrix computations. A design methodology for program-ming and con�guring the processor architecture is devel-oped. The design entry point is the space representationof the algorithm in Simulink. The Simulink description isparsed and the algorithm's Dependence Flow Graph is de-rived, which is scheduled and space-time mapped onto theproposed architecture. The compiler recon�gures the switchboxes of the proposed hierarchical interconnection networkin the architecture. An energy consumption model is de-rived, and design examples are provided that demonstratethe enhanced energy e�ciency of the proposed architecturecompared to a state of the art programmable DSP.

I. Introduction

The �eld of wireless communications is growing at anexplosive rate, stimulated by emerging applications such3G wireless systems and Internet mobile telephony. Thebasic trends in wireless communications are ubiquity andmore bits/sec. New standards for wireless communicationsare being developed, that boost up the data rates from 384kbits/sec up to 2 Mbit/sec. While most of the volume ofthe wireless tra�c is due to voice communications, wirelesscommunications are now moving towards data tra�c.

The need for higher bit rates has a both direct and in-direct impact on algorithmic complexity. Increasing thedata rate (bits/sec) has a direct increase on the compu-tational requirements of the algorithm. Yet, achieving ahigher bit rate would only be accomplished through a spec-tral bandwidth e�ciency and wireless channel capacity en-hancement. In order to meet these expectations, innova-tions in communication techniques are required. It is wellknown that the current spectrum e�ciency is well belowthe theoretical limit.

Capacity and spectrum e�ciency enhancement call forthe design of advanced communication algorithms andtechniques (Figure 1). Adaptive Beamforming will be usedfor a very fast wireless connection to the Internet thatfollows the user everywhere. MultiMegabit wireless accessmay be employed as an alternative to DSL systems. Multi-ple Transmit and Multiple Receive antennae systems (e.g.Bell Labs BLAST [1] algorithm) promise a linear increase(with the number of antennae) in the capacity of cellularcommunications. Multiuser detection and Interference can-cellation in CDMA systems can boost up the cellular capac-ity for more than 30% persent. These techniques can pro-vide dramatic increases in spectral e�ciency, while openingthe door for a peaceful co-existence of uncoordinated wire-less services in the same band without interference.

Most of the advanced algorithms employed for imple-

Air- Burst smart antennafor fixed wireless broadband access

BTSBTS

i-Burst Beamforming for high speedMobile Internet Access

Array

Proccesing

Array

Proccesing

rich scatteringenvironment

Multi-Transmit Multiple-ReceiveAntennae (BLAST)

CDMA Interference Suppression

~10Gops

Fig. 1. Future Wireless Communication Applications

menting these communications techniques originate fromthe Array Signal Processing society and are used in ArrayRadar systems. Key characteristic of these algorithms isthat they consist of advanced matrix computations. Someof the algebraic matrix problems that �nd application inthe design of these advanced adaptive wireless algorithmsare : QR Decomposition, LU Decomposition, Least Squaresestimation, Householder Transformations, Singular ValueDecomposition, Inverse of Matrix, Linear System Solver,FFT, and others.Advances in wireless information systems as envisioned

by the 3G and beyond, will cause a tremendous pressureto the underlying implementation technologies. Engineerswill be faced with the challenge to implement these compli-cated algorithms in a single chip with strict power dissipa-tion consumption. The growth in computational complex-ity over the generations of wireless systems is faster thanthe computational capabilities of silicon, as predicted byMoore's law (Figure 2). Yet, single chip DSPs are laggingbehind the processor's performance enhancement as o�eredby Moore's law. Exploration of novel low power processorarchitectures capable of delivering computational power inthe order of GOPS in an energy e�cient way is our focus.Traditional processors have a hard time delivering the

requested 3G requirements at the right energy level. Avariety of architectural alternatives in the energy e�-ciency/�exibility space is shown in Figure 3 [7]. The impactof providing full �exibility (in the style of Von Neumanmachine), results in three orders of magnitude in �ine�-ciency� between a DSP and a �xed custom implementation.Novel implementation architectures, based on recon�gura-tion and parallel/space programming provide performance

2

Fig. 2. Algorithmic Compexity

W

Fig. 3. Energy e�ciency versus �exibility

and �exibilty at energy levels close to ASIC as shown inFigure 3. We propose a massively parallel recon�gurablearray processor architecture for meeting the computationalrequirements of 3G and beyond wireless communications asit is illustrated in Figure 4.The implementation of a communication algorithm be-

gins with a mathematical description of the computationsinvolved and concludes with a hardware realization of thealgorithm. The traditional programming or �multiplexing-in-time� concept advocated in the instruction-set processor,results in an ine�cient way to map the algorithm to thespeci�c hardware platform. Instead, we propose a �pro-gramming in space� approach, where the algorithm is par-allel represented in space. By employing this approach, allthe available parallelism inherent in the communication al-gorithms, is revealed. We propose a design �ow, where analgorithm represented in Simulink is compiled to a recon-

Advanced Communication Processors

MASSIVELY PARALLEL RECONFIGURABLE ARRAY PROCESSING!

Three orders of Magnitudein Energy Efficiency

ProcessingNode

100 Processing nodes and more… Tens of Gops!

Maia Reconfigurable Processor4 Processing Nodes(MAC,ALU)

10-100MOPS/mW

Maia Reconfigurable Processor4 Processing Nodes(MAC,ALU)

10-100MOPS/mW10-100MOPS/mW

TI C55x VLIW DSP5 Processing Nodes

.5-5MIPS/mW

TI C55x VLIW DSP5 Processing NodesTI C55x VLIW DSP5 Processing Nodes

.5-5MIPS/mW

Dedicated Hardware

100-1000MOPS/mW

Dedicated HardwareDedicated Hardware

100-1000MOPS/mW

Tens of Gops!

Fig. 4. Advanced Communication Processors

Channelestimation& tracking

MIMOAntennae

CDMAMultiuserDetection

Smart AntennaeBeamforming Equalization

FormulateProblem

KalmanFilter

LMS

RLS QR-RLSSVD

LeastSquares

…

………

…

…

GivensRotations

HouseholderReflections

MatrixMultiplications

…

Inner, Outervector products

Advanced Wireless Algorithms

Adaptive Algorithms

Matrix Computations

…QRDecomposition

Fig. 5. Solving Advanced Communication Algorithms

TABLE I

Square-root covariance filter

Square-root covariance �lter[1 uH(n)K1/2(n− 1)0 λ−1/2K1/2(n− 1)

]Θ(n) =

[r(n) 0T

g(n)r1/2(n) K1/2(n)

]g(n) = (g(n)r1/2(n))(r1/2(n))−1

a(n) = y(n)− uH(n)x(n |Υn−1)

x(n + 1 |Υn) = λ−1/2x(n |Υn−1) + g(n)a(n)

�gurable array processor architecture, while performing anumber of transformations to the algorithm's dependence�ow graph.

II. Advanced Wireless CommunicationsAlgorithms

In order to illustrate our concept we now provide someexamples of advanced wireless communication algorithms.Referring to Figure 5, we claim that these communicationalgorithms can be formulated and solved under a commonframework by using well known and studied techniques inadaptive �lter theory such as Kalman �ltering, Least MeanSquares algorithm (LMS), Recursive Least Squares algo-rithm (RLS), QR decomposition based RLS, Singular ValueDecomposition (SVD), Least squares methods, and othertechniques. Most of these adaptive �ltering algorithms,consist of recursions on a set of matrix computations, e.g. :QR decomposition, Householder triangularization, Givensrotations, Matrix multiplications, Householder re�ections,inner and outer vector products, and others. The basicidea is illustrated in Figure 5.Many wireless communication problems, e.g. : equal-

ization, channel estimation, MIMO channel tracking, canbe formulated as stochastic least squares problems, andtherefore can be solved by employing Kalman �ltering [3].A square-root implementation of Kalman �ltering is shownin Table I. Adaptive nulling and beamforming [2], CDMAmultiuser detection and interference cancellation [4], mul-tiple receive/transmit antennae nulling and cancellation(BLAST), equalization, can be formulated as a determin-istic least-mean-squares problem, and solved by the recur-

3TABLE II

The QR recursive least squares algorithm

QR-RLS

Initialize : Φ1/2(0) = 0, p(0) = 0

for n = 1, 2, . . . ,[λ1/2Φ1/2(n − 1) u(n)

λ1/2p∗(n − 1) d(n)

0T 1

]Θ(n) =

[Φ1/2(n) 0

p∗(n) ξ(n)γ1/2(n)

u∗(n)Φ−∗/2 γ1/2(n)

]w∗(n) = p∗(n)Φ−1/2(n)

TABLE III

Square-Root BLAST algorithm

Square - Root BLAST

Initialize : P1/20 = 1√

aIM×M, B0 = 0N×M

for n = 1, 2, . . . , 1 HiP1/2i−1

0 P1/2i−1

−ei Bi−1

Θi =

[r1/2i 0

Kp,i P1/2i

Ai Bi

]P1/2 = P

1/2N , Qa = BN.

P1/2Σ =

[P(M−1)/2 P

(M−1)/2M

0 p1/2M

]Qa ← QaΣ.

sive least squares algorithm (RLS) [3]. One form of theRLS algorithm, based on QR decomposition, is shown inTable II. A square root implemenation [1] of the BLASTalgorithm is described in Table III. Singular value decom-position (SVD), is a very useful tool for space processingin MIMO wireless systems. Channel estimation and track-ing in multiple receive/transmit antenna, can be solved byupdating the singular value decomposition of the data ma-trix. An algorithm for updating the SVD of a data matrix[5] is shown in Table IV.

III. Space Programming of Wireless Algorithms

Mapping a communications algorithm to a parallel re-con�gurable hardware platform starts with a space rep-resentation of the algorithm, describing the actual space-time activities in the form of a Dependence Flow Graph

TABLE IV

The FST Subspace Tracking Algorithm

Type of Matrix Computation Operations

Householder transformation

xs

= V∗s x

z = x −Vsxs

vn

= z/||z||

QR Update (Givens Rotations)

√λR → R

Q

[R 00 σn

x∗s

||z||

]→

[R s0 s

]QR Update

[R s0 s

]Qa →

[R 0

sT s

]QR Update Qβ

[R 0

sT s

]→

[R s

0 σ

]Matrix-Givens Product [Vs | vn]Qa → [Vs vn]

b

A

x

x=A A bT

V

Q

v

V

V v Q V v

A

b

x=A\b

x

R

Z-1

x

Q

R

R

Z-1

Z-1

Z-1

{input}r

Rx RQ

{output}b

A

x

x=A A bT

x=A A bx=A A bT

V

Q

v

V

V v Q V vV v Q V v

A

b

x=A\b

x

A

b

x=A\b

x

R

Z-1

Z-1

Z-1

x

Q

R

R

Z-1

Z-1

Z-1

Z-1

Z-1

Z-1

Z-1

Z-1

Z-1

{input}r

Rx RQRx RQ

{output}

w =V V r;T

w =V V r;T

R = Rw

Rw =Q RRww =Q R

V w Q V w

T

=T

x = R\xx = R\x

;

;

;

;

0

R =triag(n);R =triag(n);V =eye(n,n);V =eye(n,n); x=ones(n,1);x=ones(n,1);

Algorithm in MATLAB code

MDL DGGeneratorMDL DGGenerator



Simulink MDLDependence Graphmodel file

Fig. 6. Programming algorithms in space

[8]. Simulink is an excellent environment for �system level�capture, design and simulation of wireless communicationalgorithms, and constitutes the design entry point for the�algorithm to recon�gurable architecture� design �ow.The way we currently program algorithms in space inside

the Simulink environment is basically primitive. The de-signer has to build up the whole application from scratch bychoosing primitive hardware blocks (adders, registers, mul-tipliers, etc) from the Simulink library. This approach isextremely time consuming and ine�cient, considering thealgorithms we study. We propose a method to automatethis design process, by means of providing a dependencegraph generator library of matrix computations [10]. Eachblock inside this library is a masked Simulink block, thatexecutes a MATLAB script that generates the DependenceFlow Graph of a matrix computation. The concept of thisapproach is illustrated in Figure 6, by providing an exam-ple that transforms the MATLAB code of an algorithm toits parallel dependence �ow graph.The MATLAB (.m) dependence graph generators are

based on built-in MATLAB commands that allow the userto import blocks in a Simulink model by writting code.Examples of these commands are add_block, add_line

[10]. An example of a matrix computation dependence�ow graph is shown in Figure 7.

IV. Massively Parallel ReconfigurableArchitecture

A. Architecture Model and Interconnection Network

In this section we derive the proposed architecture modelby elaborating on the problem of mapping computationalgraphs to processor arrays. The problem of mapping ma-trix computation dependence graphs to array architectureshave been addressed by the systolic array community [6]. Adesign methodology has been developed that transforms amatrix computation to a �xed array structure. The problemof mapping advanced wireless communication algorithmsto an array processor boils down to transforming a set ofconcurrent matrix computations to a recon�gurable array

4

MUL/

ADD

A11

MUL/

ADD

A21

MUL/

ADD

A12

MUL/

ADD

A22

MUL/

ADD

A13

MUL/

ADD

A23

MUL/

ADD

A14

MUL/

ADD

A24

b1

b2

b3

b4

x1

x2

x3

x4

MUL/

ADD

A11

MUL/

ADD

A12

MUL/

ADD

A13

MUL/

ADD

A14

MUL/

ADD

A21

MUL/

ADD

A22

MUL/

ADD

A23

MUL/

ADD

A24

MUL/

ADD

A11

MUL/

ADD

A21

MUL/

ADD

A12

MUL/

ADD

A22

MUL/

ADD

A13

MUL/

ADD

A23

MUL/

ADD

A14

MUL/

ADD

A24

MUL/

ADD

A31

MUL/

ADD

A32

MUL/

ADD

A33

MUL/

ADD

A34

MUL/

ADD

A41

MUL/

ADD

A42

MUL/

ADD

A43

MUL/

ADD

A44

x=A A bT

Fig. 7. Dependence �ow graph of ATA · b

structure.

The dependence �ow graph of the communication al-gorithms we study consist of many interconnected depen-dence graphs of di�erent matrix computations in parallel.To demonstrate our concept, consider the Singular ValueDecomposition update algorithm described in Table IV. ASimulink block diagram of this algorithm is shown in Fig-ure 8. The dependence graph of this algorithm consists ofa set of matrix computation's dependence graphs such as :matrix-vector multiplication, QR update, and Givens rota-tions. The connections inside the dependence graph of eachmatrix computation are considered local, while the connec-tions between the dependence graphs of di�erent matrixcomputations are considered global. An energy e�cientspace mapping should target at minimizing the global wiretransfers, and �transform� the global connections to local.

By visualizing the set of the concurrent matrix compu-tations, in a three dimensional plane, we come up with theconcept of overlay of concurrent matrix computations, as itis demonstrated in Figure 9. We can transform the globalconnections shown in Figure 8 (corresponding to the depen-dence �ow graph domain) to local connections (correspond-ing to the hardware processing units domain) by groupinga computation task from each of the computations in theoverlay and form a tightly connected SuperNode processorthat consists of the processing units that execute the setof computational tasks we grouped together. This is illus-trated in Figure 9. The global connections between thecomputational tasks in the original dependence graph havenow been transformed into local IntraSuperNode connec-tions between the processing units inside the SuperNode.The local connections inside the dependence graph of eachmatrix computation in Figure 8, are now transformed intolocal InterSuperNode connections. We refer to the resultinginterconnection network model as a hierarchical overlay ofmeshes, and it is illustrated in Figure 9.

It may be the case that the set of concurrent matrixcomputations exceeeds four which was the case in our ex-ample. In such a scenario we employ a more powerful su-pernode to boost up the computing power of the array.In order to model the supernode architecture we intro-duce the Processing Unit Bank concept which is cluster

MATRIXVECTOR

QRUpdate

MATRIXVECTOR

TRANSPOSE

QRUpdate

TRANSPOSE

GivensRotations

Problem : Space Mapping of SVD

LOCAL

GivensRotations

GLOBAL

V

Q

x s

xk

Fig. 8. Space mapping problem of SVD

v/cordic alu

mul r/cordic

SuperNode

v/cordic aluv/cordic alu

mul r/cordic

SuperNode

Givens Matrix

QR Update

Matrix Mult

QR Update

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av am r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av am r

v a

m r

v av a

m r

v a

m r

v av a

m r

v a

m r

v av a

SuperNodeOverlay of ParallelMatrix Computations

m r

v a

m r

v av a

Fig. 9. SuperNode and Overlay of Parallel Computations

of processing elements of the same arithmetic operation.Each supernode consists of four processing unit banks : theMUL/ADD processor bank, the R/CORDIC (rotate) andV/CORDIC (vector) processor bank, and the ALU pro-cessor bank. For exampe a supernode of a type [2,2,1,1]consists of 2 MUL/ADD, 2 R/CORDIC, 1 V/CORDIC,and 1 ALU units. The �nal architecture will be an hetero-geneous array of supernodes, connected with an overlay ofmeshes interconnection network as it is illustrated in Figure10. The architecture of the SuperNode is also shown. Werefer to the interconnection network inside the supernodesas the IntraSuperNode or IntraCluster Network, and theinterconnection network between the supernodes as the In-terSuperNode or InterCluster Network. In a scenario thatthe number of parallel computations exceeds the number ofavailable processing elements inside the supernode, hard-ware multiplexing or folding will be applied, as it is ex-plained later.

B. Energy Model

The Intercluster Network consists of an array of Meshesnetworks, where each output port of a supernode (cluster)routes the information along its own mesh network, thusproviding interconnection network parallelism as it is il-lustrated in Fig. 11. The IntraCluster network is shownin Figure 12, and consists of a mesh network, along witha number of multiplexers, each multiplexer for each pro-

5

[2 2 2 2] [2 1 2 1]

[2 2 2 1] [2 2 2 2] [2 2 2 2]

[1 2 2 1] [1 1 1 2] [2 2 2 2]

[3 1 1 1][2 2 2 2] [2 1 2 1]

[2 2 2 1] [2 2 2 2] [2 2 2 2]

[1 2 2 1] [1 1 1 2] [2 2 2 2]

[3 1 1 1][2 2 2 2] [2 1 2 1]

[2 2 2 1] [2 2 2 2] [2 2 2 2]

[1 2 2 1] [1 1 1 2] [2 2 2 2]

[3 1 1 1]

In In

In InMul R

AluV

Switches

Buffer

Cluster/SuperNode

[2 2 2 2] [2 1 2 1]

[2 2 2 1] [2 2 2 2] [2 2 2 2]

[1 2 2 1] [1 1 1 2] [2 2 2 2]

[3 1 1 1][2 2 2 2][2 2 2 2] [2 1 2 1][2 1 2 1]

[2 2 2 1][2 2 2 1] [2 2 2 2][2 2 2 2] [2 2 2 2][2 2 2 2]

[1 2 2 1][1 2 2 1] [1 1 1 2][1 1 1 2] [2 2 2 2][2 2 2 2]

[3 1 1 1][3 1 1 1][2 2 2 2] [2 1 2 1]

[2 2 2 1] [2 2 2 2] [2 2 2 2]

[1 2 2 1] [1 1 1 2] [2 2 2 2]

[3 1 1 1][2 2 2 2][2 2 2 2] [2 1 2 1][2 1 2 1]

[2 2 2 1][2 2 2 1] [2 2 2 2][2 2 2 2] [2 2 2 2][2 2 2 2]

[1 2 2 1][1 2 2 1] [1 1 1 2][1 1 1 2] [2 2 2 2][2 2 2 2]

[3 1 1 1][3 1 1 1][2 2 2 2][2 2 2 2] [2 1 2 1][2 1 2 1]

[2 2 2 1][2 2 2 1] [2 2 2 2][2 2 2 2] [2 2 2 2]

[1 2 2 1][1 2 2 1] [1 1 1 2][1 1 1 2] [2 2 2 2][2 2 2 2]

[3 1 1 1][3 1 1 1]

In In

In InMul R

AluV In In

In InMul R

AluV

Switches

Buffer

Cluster/SuperNode

Fig. 10. Heterogeneous array of SuperNodes

cessing unit input port. The multiplexers select the ap-propriate signal among all possible processing units outputports from the same cluster, and the signal coming froman external cluster (Ini input signals).For example, assume we need to route in parallel the

following connections : the outport of the multiplier inCluster23 has to be routed to the (b) port of the ALUprocessing unit in Cluster11, and in parallel the (r) portof the RCORDIC processing unit in Cluster13 has to berouted to the (z) port of the MUL processing unit in theCluster21. The con�guration of the InterCluster and Intra-Cluster switches is shown in Figures 11, and 12 respectively.The RC model of the Intercluster and Intracluster inter-

connect routing is depicted in Figure 13. The capacitanceCd and resistance Ron denote the di�usion capacitance andthe resistance of the pass transistor switch respectively.The quantities Rlwand Clw denote the resistance and ca-pacitance of the intracluster (local) wire, while the quan-tities Rgw and Cgw denote the resistance and capacitanceof the intercluster (global) wire. Denote byNghops , andNlhops the number of hops inside the Intercluster and Intr-acluster network respectively, and by b the datapath width(e.g. 16 bits). Based on Figure 13 the total energy dissi-pated on the interconnection network is given by

EIN = NgEglobal + NlElocal

= Ng [(7Cd + Clw) + Nghops(6Cd + Cgw)

+ Nlhops(6Cd + Clw) + 14Cd)]V · Vswingb]

+ Nl[(14Cd + Clw)V · Vswingb] (1)

The total energy is given by ETOT = EPU + EMEM +EIN , where the �rst term is contributed from the process-ing elements, the second from data bu�ering, and the thirdfrom the interconnection network. An Elmore RC delaymodel is developed in [10]

V. Simulink to Reconfigurable Model Compiler

An overview of the proposed design methodology isshown in Figure 14. The �rst step of the design �ow isillustrated in Figure 6. This dependence �ow graph thatrepresents the algorithm in parallel, as well as the architec-ture model consitute the input to the compiler. The outputof the compile process is an executable recon�gured archi-tecture Simulink model. Each computational node in the

...

...

..

...

...

..

...

...

..

...

...

..

…Cluster11

Cluster21

Cluster12

Cluster22

Cluster13

Cluster23

out out out out

…in in in in

…1

1 2

2

3

3 4

4out out out out

…in in in in

…outout outout outout outout

…in in in in

…ininin ininin ininin ininin

…1

1 2

2

3

3 4

4out out out out

…in in in in

…1

1 2

2

3

3 4

4out out out out

…in in in in


…in in in in


…1

1 2

2

3

3 4

4out out out out

…in in in in

…1

1 2

2

3

3 4

4out out out out

…in in in in


…in in in in


…1

1 2

2

3

3 4

4

out out out out

…in in in in

…1

1 2

2

3

3 4

4out out out out

…in in in in


…in in in in


…1

1 2

2

3

3 4

4 out out out out

…in in in in

…1

1 2

2

3

3 4

4out out out out

…in in in in


…in in in in


…1

1 2

2

3

3 4

4 out out out out

…in in in in

…1

1 2

2

3

3 4

4out out out out

…in in in in


…in in in in


…1

1 2

2

3

3 4

4

Fig. 11. Overlay of Meshes InterCluster Network

r sign xx y z

MUL RCORDIC

r x

VCORDIC

a b

ALU

z

Out1

Out1Out2Out3Out4Out5

r sign x

Out2 Out3 Out4

Out6

sign r

Out5 Out6 Out7

c

Out7

In1 In2 In3 In4

In5In6 In7 In8

r sign xx y z

MUL RCORDIC

r x

VCORDIC

r x

VCORDIC

a b

ALU

z

Out1Out1Out1

Out1Out1Out2Out2Out3Out3Out4Out4Out5Out5

r sign x

Out2Out2Out2 Out3Out3Out3 Out4Out4Out4

Out6Out6

sign r

Out5Out5Out5 Out6Out6Out6 Out7Out7Out7

c

Out7Out7

In1InIn1 In2InIn2 In3InIn3 In4InIn4

In5InIn5In6InIn6 In7InIn7 In8InIn8

Fig. 12. [1111] IntraCluster Network

dependence graph is modeled as a C-MatlabEXecutable S-function that it is linked to our developed �xed point ma-trix computation Library [10]. Each functional block insidethat library is a masked S-function. The mask parametersspecify the hardware characteristics of the processing nodessuch as : �xed point data type, number of pipelines stages,number of cordic iterations, and others.

In order to process the dependence graph of an algo-rithm represented as a Simulink model, we need to trans-form the Simulink description (.mdl �le) into a C++ datastructure. Our compiler transforms the .mdl �le to a C++direct graph. The program is written in C++, and con-sists of a lexical analyzer mdllexer.l (using Flex ), and aparser mdlparser.y (using Bison). The output of the pars-ing process is a C++ graph that represents the dependency�ow graph of an algorithm. A Simulink .mdl �le is a hier-archical tree structured ASCII �le that contains keywordsand parameter-value pairs that describe the model. Themdl2dfg program builds this hierarchical tree as a C++list of linked lists, after the parsing process. This tree iswalked down such that all the dependency edges, and thedelays associated with each edge, are extracted.

The computational nodes of the graph are subsequently

6

7Cd7Cd

Rlw

Rlw

Clw

…3Cd 3Cd

R on

Cgw

Rgw

3Cd3Cd 3Cd3Cd

R onR on

CgwCgw

RgwRgw

3Cd 3Cd

R on

Cgw

Rgw

3Cd3Cd 3Cd3Cd

R onR on

CgwCgw

RgwRgw

3Cd 3Cd

R on

Cgw

Rgw

3Cd3Cd 3Cd3Cd

R onR on

CgwCgw

RgwRgw

…3Cd 3Cd

R on

Clw

Rlw

3Cd3Cd 3Cd3Cd

R onR on

ClwClw

RlwRlw

3Cd 3Cd

R on

Clw

Rlw

3Cd3Cd 3Cd3Cd

R onR on

ClwClw

RlwRlw

3Cd 3Cd

R on

Clw

Rlw

3Cd3Cd 3Cd3Cd

R onR on

ClwClw

RlwRlw

3Cd3Cd 4Cd4Cd

R onR on

7Cd

RonRon

Number of InterCluster Hops

OutputPin

Number of IntraCluster Hops

InputPin

RC Equivalent to InterCluster Routing

7Cd

R lw

7Cd

RonInputPin

OutputPin

Clw

7Cd7Cd

R lwR lw

7Cd

RonRonInputPin

OutputPin

Clw

Clw

RC Equivalent of IntraCluster Routing

Fig. 13. RC equivalent to Inter and Intra Cluster routing

.mdl Simulink file.mdl Simulink file

c

mdl2dfgFlex/Bison parser

Simulink .mdl file

C++ direct graph

SchedulingRetiming

Mapping minimizescost function

(Energy)

Folding EquationsHardware multiplexing

Masked C-MEX S-functionDialog Box Parameters:

•Hardware parameters :# Cordic rotations, pipeline stages,fixed point data type•# Cluster, # Proc Unit , this taskis assigned to•etc …

Masked C-MEX S-functionDialog Box Parameters:

•Hardware parameters :# Cordic rotations, pipeline stages,fixed point data type•# Cluster, # Proc Unit , this taskis assigned to•etc …

SpecifyArchitectureSpecify

Architecture

ClusterLibrary

Reconfigure

.m MATLAB script

Reconfigure

.m MATLAB script

Configuration Data :•Switches•Buffers•Processor Scheduling

Configuration Data :•Switches•Buffers•Processor Scheduling

In In

In Incordic alu

mulcordic In In

In Incordic alu

mulcordic

network network

network

network

ClusterSwitch

Buffer

ReconfiguredArchitecturein SimulinkCluster

WirelessAlgorithmin MATLAB

DependenceGraphGeneratorLibrary

DependenceGraphGeneratorLibrary

Manual PartitioningTask assignment

Fig. 14. Design �ow methodology

scheduled by an as soon as possible ASAP scheduling rou-tine described in section V.A. The dependence graph hasto be partitioned to �t the target array architecture andeach computational task in the graph should be assignedto a processing unit inside a SuperNode. This mappingprocess is described in section V.B.

At this particular point of the design �ow, each objectnode in the C++ graph has three properties associatedwith it : the SuperNode unit that a computational task isassigned to, the Processing unit inside that supernode thatexecutes that computational task, and the time step thatthis operation takes place (Schedule). The three object�elds of each node are input to the Hardware Multiplex-ing/Folding Transformation block which derives con�gu-rations for the switchboxes in the array of meshes network,and programs the memory bu�ers of each of the processingunits. Finally, the Reconfigure.m MATLAB script readsthe con�guration data from a �le, and it automaticallygenerates the recon�gured architecture model, as an ex-ecutable Simulink .mdl �le. The initial algorithm writtenin MATLAB, and the executable recon�gured architectureshould have the same behavioral functionality.

A. Scheduling

The scheduling algorithm employed in our design �owis based on the �As soon as possible� ASAP schedulingtechnique [8]. This algorithm is a modi�cation of the orig-inal ASAP scheduling, which assumes unlimited resources.We modify the ASAP scheduler by taking into considera-tion the limited number of resources. Denote the time stepwhich a particular node n is assigned to be executed on theprocessing element PEn, as time step TSn. Assume alsothat the computation takes Cn time steps. Our schedulingalgorithm has as follows :

Input : DG =(N,E)

Output : Scheduled Graph

∀ node n ∈N : Predecessorsn = NULL ⇒TSn = 0while(number of unscheduled nodes > 0)

∀ node nj : if ( all Predecessorsnjare scheduled )

TSj = maxTSi + Ci ∀ predecessor iwhile (∃ node nk : TSk = TSj && PEj = PEk)

TSj = TSj + 1;number of unscheduled nodes+1;

B. Mapping

The problem of assigning tasks of a larger size graph ofcomputations to a parallel computer of a smaller size isreferred to as algorithm partitioning [6]. This problem isalso known in the systolic design community, as space-timemapping of computations onto a �xed size array processor.The processor allocation problem involves the partition-

ing of the dependence graph into many blocks, each consist-ing of a cluster of nodes in the dependence graph. The twomost known methods of partitioning a dependence graph,and assigning computational tasks to processing elementsare : the locally sequential globally parallel (LSGP) method,and locally parallel globally sequential (LPGS) method [6].As it is demonstrated in [10], the LPGS scheme, which ismore tailored to matrix computations, su�ers from exten-sive global connectivity (energy dissipation) each time theintermediate blocks results are loaded into the array forprocessing the next block. The Interleaved LPGS spacemapping is proposed in [10], which only requires local con-nections between the processing elements, in the expense ofrecon�guring the interconnection network from a processedblock to the next block basis, as it is illustrated in Figure15.These mapping techniques originate from the systolic ar-

ray community, which try to map a matrix computationto a homogeneous �xed array processor. Mapping wirelesscommunications to recon�gurable architectures, requiresthe transformation of a set of concurrent matrix computa-tions to an array of heterogeneous supernodes. We proposetwo ways to solve this problem. The �rst technique appliesthe interleaved LPGS partitioning technique to each of thematrix computations involved, which maps the computa-tion along the supernode array as well as to the processingunit responsible for executing the particular type of ma-trix computation (e.g. R/CORDICs for Givens rotations).This process is repeated for all the computations, until all

7

P0 P1 P2

P3 P4 P5

P6 P7 P8

t=0

P8 P7 P6

P5 P4 P3

P2 P1 P0

t=3, change directionalong i,j axis

permutecolumnsrows

P2 P1 P0

P5 P4 P3

P8 P7 P6

permutecolumns

t=2, change directionalong i axis

Proposed PartitioningInterleaved Cut and Pile

P6 P7 P8

P3 P4 P5

P0 P1 P2

t=1, change directionalong j axis

permuterows

Configureconnections

P6 P7 P8

P3 P4 P5

P0 P1 P2


permuterows

P6 P7 P8

P3 P4 P5

P0 P1 P2


permuterows

P6 P7 P8

P3 P4 P5

P0 P1 P2

P6 P7 P8

P3 P4 P5

P0 P1 P2


permuterows

Configureconnections

Fig. 15. Proposed Interleaved LPGS

the processing units in a supernode have been assigneda task to execute. Subsequently, hardware multiplexing/folding is applied as it is explained in the next subsection.The second technique, maps the entire dependence �owgraph of the algorithm to the supernode array and to theprocessing units, by means of minimizing a cost function.Since our cost metric is the energy e�ciency, this mappingmethod tries to minimize the Interconnection equivalentDistance de�ned as ID = α · Ninter + β · Nintra, whereNinter, Nintra are the totel number of hops in the InterSu-perNode and IntraSuperNode mesh networks, and α, β areweight factors.

C. Folding Transformation

Systematic design of hardware multiplexing is achievedthrough the folding transformation [8]. This transforma-tion derives control circuits for DSP architectures wheremultiple arithmetic operations are assigned to be executedto a single functional unit. Folding is used in order to foldthe fully parallel space representation of an algorithm (DG)into the recon�gurable �xed size parallel processor. Theinput to the folding transformation is the scheduled andspace-mapped dependence graph of the algorithm. Theoutput of the folding transformation re-schedules (in casewe apply retiming) the processing elements in the SuperN-ode, con�gures the switches in the interconnection network,and programs the memory bu�ers of each of the processingelements.After the folding transformation the original dependence

graph (graph of computation tasks) is transformed to afolded graph, where now the graph represents the actualprocessing units in the architecture, rather than the com-putation tasks as in the original graph. The vertices ofthe folded graph are the available processing elements in-side the clusters, and the edges represent the number ofdelays the result from the source processing element mustbe stored, before it is used from the destination processingelement.For a folded system to be realizable, the folded data path

delays [8] must be nongegative. In case the folding trans-formation is nonrealizable we apply retiming to change thedelays in the original graph. This retiming problem can bereformulated as a shortest path problem and can be solvedby employing the Bellman-Ford algorithm. If a solution

TABLE V

Energy breakdown for the biquad filter

Architecture Module Energy (pJ)

Interconnection Network 44Processing Units 100

Memory 84

Total 228

exists, then the retiming value r(X) is the shortest pathfrom an arbitary node O to node X.

VI. Design Examples

We provide two design examples where the algorithmsare �rst described in Simulink, and our compiler maps themonto the recon�gured architecture. The �rst example is abiquad �lter, and the second example is the SVD updatealgorithm. The energy e�ciency of our architecture is com-pared with the energy e�ciency of one of the state of theart Texas Instruments' DSPs.

A. Biquad Filter

The biquad �lter (Figure 6.3 in [8]) is manually mappedto the processing elements in the architecture which con-sists of one Cluster. This �lter is mapped to two processingelements : one MUL processing and one ALU unit. TheMUL hardware module has 2 pipeline stages and the ALUmodule one pipeline stage. The Simulink model of the �l-ter is parsed, and the algorithm's graph is scheduled andretimed in order to be folded and executed by the two pro-cessing modules.

Our compiler also computes the number of interconnec-tions that need to be routed between di�erent processingelements, the times each processing unit is scheduled duringone iteration of the algorithm, and the number of memorybu�er operations for one execution of the algorithm. Sub-stituting these numbers into the energy dissipation modelin Eq.1 we obtain the energy estimation breakdown shownin Table V.

The energy e�ciency of the proposed architecture whileexecuting the biquad �lter is Energyeff = OP/Energy =

8228pJ = 35 MOPS/mW.

The power consumption of the Texas InstrumentsTMS320C64xx processor family is now discussed. Thefollowing table shows the power consumption of theTMS320C641 processor, at 50% high/50% low activity.

Processor Baseline (W) Activity (W) Total (W)600 MHz/1.4V 0.25 0.17 0.42500 MHz/1.2V 0.15 0.11 0.26

The biquad �lter was described in C by using the TexasInstruments Code Composer Studio, and the code was com-piled and run on the TMS320C641 processor. The numberof clock cycles to execute the biduad �lter was found to be72. Therefore, the energy e�ciency of the TMS360C641

8

Fig. 16. SVD update Simulink representation

TABLE VI

Energy breakdown for the SVD algorithm

Energy Energy (pJ)

Interconnection 1660Processing Units 1580

Memory 875

Total 4115

processor while executing the biquad �lter is Energyeff =OP/Energy = 8

0.26W ·72/500·106 ≈ 0.25 MOPS/mW

B. SVD Update

The Simulink representation of the SVD update algo-rithm is shown in Figure 16. The Simulink model is com-piled to a dependence graph, which is scheduled, mapped,and folded to an 4 by 4 array cluster. An estimation of theenergy dissipation contributed from each module is shownin Table VI. The histograms of the interconnection routingalgorithm requirements are shown in Figure 17.The worst case delay across the interconnection network

was found to be delay ≈ 9.15 · 10−8sec, which means thatthe architecture can be clocked as fast as 10MHz. Thepower consumption is P = 4115pJ

N ·delay ≈ 20mW , where N is

the folding factor which was 2 in this example. The 4×4array cluster is capable to deliver

(AAT b)op + (a− b)op + 2 ·QRop + (V ·Q)op

N · delay=

72 · 108

18.3≈ 400MOPS

while executing the SVD algorithm. The energy e�ciency

0 2 4 60

10

20

30

40

50

60

70

80

InterCluster Number of Hops

Num

ber

of o

ccur

ence

s

Histogram of routing distances

0 1 2 3 40

10

20

30

40

50

60

IntraCluster Number of Hops

Num

ber

of o

ccur

ence

s

Histogram of routing distances

Fig. 17. Distribution of distance travelled across interconnectionnetwork

TABLE VII

Energy Efficiency Comparison

Energy E�ciency MOPS/mW

Architecture Biquad Filter SVD

TMS320C64xx (.15µ, 1.2V) 0.25 0.013

Proposed Architecture (.25µ, 1V) 40 17

Improved Energy E�ciency Factor 160 1300

Pleiades [7]( 0.25µ, 1V ): 10-80 MOPS/mW (VCELP algorithm)

of our architecture for the SVD algorithm is Energyeff =OP/Energy = 72

4115pJ ≈ 17 MOPS/mW . The SVD up-date algorithm was written in C by using the Code Com-poser Studio in order to simulate the TMS320C641X DSPprocessor. After compiling and pro�ling the total num-ber of clock cycles for the SVD algorithm was found tobe N=11,311 clock cycles. The TMS320C64XX DSP isable to deliver 72

11311/500·106 ≈ 3.2 MOPS when execut-

ing the SVD algorithm. Its average power consumptionis 0.25W. Therefore its energy e�ciency is Energyeff =3.2 MOPS/0.26W ≈ 0.013 MOPS/mW . A comparisonbetween the energy e�ciency of the recon�gurable archi-tecture and the TMS320C641 processor is shown in TableVII. We conclude that there is a 2 to 3 orders of magni-tude di�erence in energy e�ciency between the DSP andthe recon�gurable architecture.

VII. Conclusions

We proposed a massively parallel recon�gurable archi-tecture and a compiler that transforms advanced wirelesscommunication algorithms expressed in Simulink to an ex-ecutable Simulink model that represents the recon�guredarchitecture. We provide examples that demonstrate a twoto three orders of magnitude in energy e�ciency betweenthe proposed arhitecture and a state of the art DSP pro-cessor.

References

[1] B. Hassibi, �An e�cient Square-Root Algorithm for BLAST�,IEEE Conf. on Acoustics, Speech and Signal Processing, 2000,vol2, pp.737-740

[2] C. M. Rader �VLSI Systolic Arrays for Adaptive Nulling�, IEEESignal Processing Magazine, July 1996, pp. 29-49

[3] S. Haykin, Adaptive Filter Theory , Prentice Hall 1996[4] X. Wang, H. V. Poor, � Blind Multiuser Detection: A Subspace

Approach�, IEEE Trans. on Information Theory, vol. 44, No. 2,March 1998

[5] D. J. Rabideau. Fast, Rank Adaptive Subspace Tracking andApplications. IEEE Trans. on Signal Processing, vol. 44, No. 9,September 1996

[6] S. Y. Kung, �VLSI Array Processors�, 1988 Prentice Hall[7] J. Rabaey, �Silicon Platforms for the next generation wireless

systems - What role does recon�gurable hardware play?�, Pro-ceedings FPL 2000, Austria, August 2000

[8] K.K. Parhi �VLSI Digital Signal Processing Systems�, 1999 JohnWiley

[9] K. Sarrigeorgidis, Jan Rabaey �Ultra Low Power CORDIC Pro-cessor for Advanced Wireless Communications Algorithms�, sub-mitted in VLSI Signal Processing Systems

[10] K. Sarrigeorgidis �Massively Parallel Recon�gurable Wire-less Processor Architecture and Programming � Ph.D Thesis,U.C.Berkeley, Dec, 2002