DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

DSP Architectural Considerations for Optimal Baseband Processing

Sridhar Rajagopal

Scott Rixner

Joseph R. Cavallaro

Behnaam Aazhang

Rice University, Houston, TX

This work has been supported by Nokia, TI, TATP and NSF

Wireless Communication Systems

Flexibility is required Mobile

– Switch between standards– Switch between parameters

Base-station– Varying number of users– Each user has different parameters

Wireless MobileDevice

BasebandProgrammable

CommunicationsProcessor

RF UnitA/DD/A

Integration of Cellular/Wireless LAN

W-CDMA base-station– 4Mbps– Delay constraints– Area constraints?

W-LAN base-station– 100Mbps– Delay constraints– Area constraints?

Mobile– W-CDMA & W-LAN– 1Mbps & 100Mbps/# of users– Delay, area, and power constraints!

Computation Requirements

Estimation, Detection and Decoding in a 4Mbps W-CDMA cellular multiuser system

0 50 100 150 200 250 30010

0

101

102

103

AL

Us

req

uir

ed f

or

real

-tim

e at

500

MH

z

Number of W-CDMA Cellular Users

AddMultiply

SLOW FADING (estimation every 1000 bits)

MEDIUM FADING(estimation every 100 bits)

FAST FADING(estimation every 10 bits)

DATA RATES PER USER

Proposed DSP System Evolution

Current solutions to meet real-time(Racks of DSPs)

ProgrammableDSP Processor

for4G wireless

systems

< x cm

< x cm

Future wireless DSP architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)

The System Design Challenge

Current single processor DSPs not powerful enough for next generation multi-standard applications

Algorithms well understood at data-flow level Can design real-time systems in fixed VLSI Pushing towards programmable implementation Stream processors provide an interesting

alternative

Research Contributions

Algorithms for future wireless communications– Multiuser channel estimation and detection– Task partitioning, parallelism, pipelining– Used DSPs to develop and understand algorithms

Special-purpose implementations– VLSI and FPGA mappings of algorithms– Conventional and on-line arithmetic

Flexible implementations (current work)– Future DSP architectures?– Stream processors?– Architectural innovation– Functional unit design and usage

Outline

Motivation

Parallel Algorithms for Estimation, Detection, and Decoding

Stream Processor Architecture

Performance Comparisons and Results

Typical Base-station Algorithms

Equalization? FFT Viterbi decoding

Multiuser channel estimation Multiuser detection Viterbi decoding

Turbo decoding Multiple antenna systems

Wireless LAN

W-CDMA

Advanced receiver schemes

Parallel W-CDMA Estimation/Detection/Decoding

Multiuser estimation– replaced matrix inversion by gradient descent

Multiuser detection– Parallel Interference Cancellation (PIC)– Pipelined algorithm that avoids block-based detection

Viterbi decoding– Trellis structures suited for decoding– Register exchange for survivor memory– No traceback latency

Estimation/Detection (64,32 sizes)

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)RR*A(AA brbb

1ii1iii RxCxLxyy )y(signd ii

H

1H10

H01

H10

H0

1H0

L R

)]AAAdiag(AAAARe[A C

]ARe[A L

)y(signd

]xAxARe[y

ii

1iH1i

H0i

MultiuserEstimation

MultiuserDetection

Prepare Matricesfor Detection

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

Viterbi Trellis for Rate ½ Code with K = 5

Survivor Management in Viterbi Decoding

Two techniques– Traceback – commonly used – Register exchange

Traceback is simpler– Less area in VLSI architectures– Drawback: Sequential and additional latency

Register exchange is faster– Parallel updates– Packing decoded bits in the register needs to access

the entire register

Outline

Motivation




DSP Evolution and Trends

DSP Architectures– Increased parallelism and computational throughput– TI TMS 320C6x generation of VLIW DSPs

Media Processing Architectures– Orders of magnitude increase in parallelism and

computational throughput – 3D graphics!– Imagine processor developed at MIT/Stanford

• Prototype fabricated and licensed by TI

• Flexible and extensible VLIW multiple cluster architecture

– Applicable to wireless communications?

The Imagine Architecture

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntr

olle

r

Arithmetic Clusters

VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Bandwidth Hierarchy

41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Stream Programming

StreamC– Executes on host

processor– C++– Controls stream transfers

between main memory and SRF

void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }

KernelC– Executes on clusters– C-like Syntax– Kernel computation– Compiled by iscd

KERNEL example1(istream<int> a,

istream<int> b,

ostream<int> c) {

loop_stream(a) {

int ai, bi, ci;

a >> ai;

b >> bi;

ci = ai * 2 + bi * 3;

c << ci;

}

}

1024-point FFT Performance

Processor Frequency Data Radix Time Power Energy

Imagine 500MHzFloat

(32-bit)2 7.4s 3.8W 28J

C6711 150MHzFloat

(32-bit)2 138s ~1.3W

C6411 300MHzFixed

(16-bit)mixed 21s ~0.5W

C6201 100MHzFixed

(16-bit)2 227s 0.6W 144J

Virtex II 125MHzFixed

(16-bit)4 2s

Media vs. Communications

Similarities– Data parallelism– Low-precision data– High computation rates

Different characteristics of communications processing– More data reorganization, such as matrix transposes– Bit-level operations

Explore space of stream processor architectures with isim– Cycle-accurate stream processor simulation– Flexible machine description language (read by both

simulator and compiler)– Vary number and design of functional units– Vary memory, register sizes– etc.

Outline

Motivation




Stream Data Flow

Matrixtranspose

Viterbikernel

Matrix multkernel

Correlationupdate kernel

Matrix mulC kernel

Data rearrangement

Buffer

Estimation bits

Detectionbits

MultiuserChannel

Estimation

MultiuserDetection

Decoding

Computation

Communication

Iterationupdate kernel

Matchedfilter kernel Matrix mul

L kernel

PIC kernel

Matrix Multiplication Kernel (Imagine)

32 cycle loop Executed on all 8 clusters Complexity

– O(N3) multiplies

– O(N3) adds 100% multiplier utilization

in the loop Divider is unnecessary!

Inner Loop

Instruction

Communication(waiting for input)

FU unavailable(input ready but

FU busy)

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0

Replace Divider with Multiplier

22 cycle loop Executed on all 8 clusters 97% multiplier utilization

in the loop 85% adder utilization in

the loop

Changing functional units– Supported by

simulator/compiler

– Architecturally realistic

Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2

Kernel Computational Time

Algorithm Kernel Functional unit

utilization (3 +, 2 *)

Execution Time

(cycles)

Functional unit utilization (3 +, 3 *)

Execution Time

(cycles) 1 70%,100% 1224 78.6%,78% 1064

Est- 2 53%,91% 22720 85%,99% 14360 imate 3 55%,42% 1058 55%,28% 1058

Total 14464 Glue 4 59%,91% 7468 78%,84% 5573

Matrices 5 63%,96% 12192 68%,71% 11084 Total 16657

Detect 6 67%,100% 366 90%,90% 275 7 67%,96% 996 89%,84% 760 Total 1035

Decode 8 13%,2% 8044 13%,1.4% 8044

Estimation and Detection ExecutionKernel Execution Memory TransfersCycle

Stalledwaitingfor data

frommemory

Estimation

Detection(10 bits)

Viterbi Execution

Initialization

Decode(32 bits)

Kernel Execution Memory TransfersCycle

Real-time Performance

Slow Fading Medium Fading Fast Fading0

0.5

1

1.5

2

2.5x 10

4

estimationdetectiondecodingstall time

Real-Timeat 500 MHz

Clo

ck c

ycle

s

Rough DSP Comparison

1 2,3 4,5 6 710

-7

10-6

10-5

10-4

10-3

10-2

10-1

Estimation

Exe

cuti

on t

ime

IMAGINE

TI C67: Internal Memory

TI C67: External Memory

GlueMatrices

Detection

Future Work

Achieve real-time rates– Additional functional units (that can be used

efficiently!)– Eliminate communication stalls between kernels– Support for matrix transposes and bit-level operations

Power and area constraints– Low power stream processing– Scaling the architecture for handsets

Scalability with data rates– Boundaries of the architecture

Handset algorithms

Conclusions

Future wireless communications algorithms– exceed the capabilities of current DSPs– require flexibility to change algorithms and parameters– require efficient use of resources because of delay,

area, and power constraints Architectural developments are needed for

future DSPs– Stream processing is a promising approach– Additional hardware acceleration, akin to Viterbi

coprocessor on C64? The insights gained from our designs can be

applied to DSPs and other processors with constraints on delay, area and power.

Documents

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,