32
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston, TX This work has been supported by Nokia, TI, TATP and NSF

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Embed Size (px)

Citation preview

Page 1: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

DSP Architectural Considerations for Optimal Baseband Processing

Sridhar Rajagopal

Scott Rixner

Joseph R. Cavallaro

Behnaam Aazhang

Rice University, Houston, TX

This work has been supported by Nokia, TI, TATP and NSF

Page 2: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Wireless Communication Systems

Flexibility is required Mobile

– Switch between standards– Switch between parameters

Base-station– Varying number of users– Each user has different parameters

Wireless MobileDevice

BasebandProgrammable

CommunicationsProcessor

RF UnitA/DD/A

Page 3: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Integration of Cellular/Wireless LAN

W-CDMA base-station– 4Mbps– Delay constraints– Area constraints?

W-LAN base-station– 100Mbps– Delay constraints– Area constraints?

Mobile– W-CDMA & W-LAN– 1Mbps & 100Mbps/# of users– Delay, area, and power constraints!

Page 4: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Computation Requirements

Estimation, Detection and Decoding in a 4Mbps W-CDMA cellular multiuser system

0 50 100 150 200 250 30010

0

101

102

103

AL

Us

req

uir

ed f

or

real

-tim

e at

500

MH

z

Number of W-CDMA Cellular Users

AddMultiply

SLOW FADING (estimation every 1000 bits)

MEDIUM FADING(estimation every 100 bits)

FAST FADING(estimation every 10 bits)

DATA RATES PER USER

Page 5: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Proposed DSP System Evolution

Current solutions to meet real-time(Racks of DSPs)

ProgrammableDSP Processor

for4G wireless

systems

< x cm

< x cm

Future wireless DSP architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)

Page 6: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

The System Design Challenge

Current single processor DSPs not powerful enough for next generation multi-standard applications

Algorithms well understood at data-flow level Can design real-time systems in fixed VLSI Pushing towards programmable implementation Stream processors provide an interesting

alternative

Page 7: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Research Contributions

Algorithms for future wireless communications– Multiuser channel estimation and detection– Task partitioning, parallelism, pipelining– Used DSPs to develop and understand algorithms

Special-purpose implementations– VLSI and FPGA mappings of algorithms– Conventional and on-line arithmetic

Flexible implementations (current work)– Future DSP architectures?– Stream processors?– Architectural innovation– Functional unit design and usage

Page 8: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Outline

Motivation

Parallel Algorithms for Estimation, Detection, and Decoding

Stream Processor Architecture

Performance Comparisons and Results

Page 9: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Typical Base-station Algorithms

Equalization? FFT Viterbi decoding

Multiuser channel estimation Multiuser detection Viterbi decoding

Turbo decoding Multiple antenna systems

Wireless LAN

W-CDMA

Advanced receiver schemes

Page 10: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Parallel W-CDMA Estimation/Detection/Decoding

Multiuser estimation– replaced matrix inversion by gradient descent

Multiuser detection– Parallel Interference Cancellation (PIC)– Pipelined algorithm that avoids block-based detection

Viterbi decoding– Trellis structures suited for decoding– Register exchange for survivor memory– No traceback latency

Page 11: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Estimation/Detection (64,32 sizes)

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)RR*A(AA brbb

1ii1iii RxCxLxyy )y(signd ii

H

1H10

H01

H10

H0

1H0

L R

)]AAAdiag(AAAARe[A C

]ARe[A L

)y(signd

]xAxARe[y

ii

1iH1i

H0i

MultiuserEstimation

MultiuserDetection

Prepare Matricesfor Detection

Page 12: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

Viterbi Trellis for Rate ½ Code with K = 5

Page 13: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Survivor Management in Viterbi Decoding

Two techniques– Traceback – commonly used – Register exchange

Traceback is simpler– Less area in VLSI architectures– Drawback: Sequential and additional latency

Register exchange is faster– Parallel updates– Packing decoded bits in the register needs to access

the entire register

Page 14: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Outline

Motivation

Parallel Algorithms for Estimation, Detection, and Decoding

Stream Processor Architecture

Performance Comparisons and Results

Page 15: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

DSP Evolution and Trends

DSP Architectures– Increased parallelism and computational throughput– TI TMS 320C6x generation of VLIW DSPs

Media Processing Architectures– Orders of magnitude increase in parallelism and

computational throughput – 3D graphics!– Imagine processor developed at MIT/Stanford

• Prototype fabricated and licensed by TI

• Flexible and extensible VLIW multiple cluster architecture

– Applicable to wireless communications?

Page 16: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

The Imagine Architecture

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntr

olle

r

Page 17: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Arithmetic Clusters

VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Page 18: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Bandwidth Hierarchy

41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Page 19: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Stream Programming

StreamC– Executes on host

processor– C++– Controls stream transfers

between main memory and SRF

void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }

KernelC– Executes on clusters– C-like Syntax– Kernel computation– Compiled by iscd

KERNEL example1(istream<int> a,

istream<int> b,

ostream<int> c) {

loop_stream(a) {

int ai, bi, ci;

a >> ai;

b >> bi;

ci = ai * 2 + bi * 3;

c << ci;

}

}

Page 20: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

1024-point FFT Performance

Processor Frequency Data Radix Time Power Energy

Imagine 500MHzFloat

(32-bit)2 7.4s 3.8W 28J

C6711 150MHzFloat

(32-bit)2 138s ~1.3W

C6411 300MHzFixed

(16-bit)mixed 21s ~0.5W

C6201 100MHzFixed

(16-bit)2 227s 0.6W 144J

Virtex II 125MHzFixed

(16-bit)4 2s

Page 21: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Media vs. Communications

Similarities– Data parallelism– Low-precision data– High computation rates

Different characteristics of communications processing– More data reorganization, such as matrix transposes– Bit-level operations

Explore space of stream processor architectures with isim– Cycle-accurate stream processor simulation– Flexible machine description language (read by both

simulator and compiler)– Vary number and design of functional units– Vary memory, register sizes– etc.

Page 22: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Outline

Motivation

Parallel Algorithms for Estimation, Detection, and Decoding

Stream Processor Architecture

Performance Comparisons and Results

Page 23: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Stream Data Flow

Matrixtranspose

Viterbikernel

Matrix multkernel

Correlationupdate kernel

Matrix mulC kernel

Data rearrangement

Buffer

Estimation bits

Detectionbits

MultiuserChannel

Estimation

MultiuserDetection

Decoding

Computation

Communication

Iterationupdate kernel

Matchedfilter kernel Matrix mul

L kernel

PIC kernel

Page 24: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Matrix Multiplication Kernel (Imagine)

32 cycle loop Executed on all 8 clusters Complexity

– O(N3) multiplies

– O(N3) adds 100% multiplier utilization

in the loop Divider is unnecessary!

Inner Loop

Instruction

Communication(waiting for input)

FU unavailable(input ready but

FU busy)

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0

Page 25: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Replace Divider with Multiplier

22 cycle loop Executed on all 8 clusters 97% multiplier utilization

in the loop 85% adder utilization in

the loop

Changing functional units– Supported by

simulator/compiler

– Architecturally realistic

Instruction ADD0 ADD1 ADD2 MUL0 MUL1 MUL2

Page 26: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Kernel Computational Time

Algorithm Kernel Functional unit

utilization (3 +, 2 *)

Execution Time

(cycles)

Functional unit utilization (3 +, 3 *)

Execution Time

(cycles) 1 70%,100% 1224 78.6%,78% 1064

Est- 2 53%,91% 22720 85%,99% 14360 imate 3 55%,42% 1058 55%,28% 1058

Total 14464 Glue 4 59%,91% 7468 78%,84% 5573

Matrices 5 63%,96% 12192 68%,71% 11084 Total 16657

Detect 6 67%,100% 366 90%,90% 275 7 67%,96% 996 89%,84% 760 Total 1035

Decode 8 13%,2% 8044 13%,1.4% 8044

Page 27: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Estimation and Detection ExecutionKernel Execution Memory TransfersCycle

Stalledwaitingfor data

frommemory

Estimation

Detection(10 bits)

Page 28: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Viterbi Execution

Initialization

Decode(32 bits)

Kernel Execution Memory TransfersCycle

Page 29: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Real-time Performance

Slow Fading Medium Fading Fast Fading0

0.5

1

1.5

2

2.5x 10

4

estimationdetectiondecodingstall time

Real-Timeat 500 MHz

Clo

ck c

ycle

s

Page 30: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Rough DSP Comparison

1 2,3 4,5 6 710

-7

10-6

10-5

10-4

10-3

10-2

10-1

Estimation

Exe

cuti

on t

ime

IMAGINE

TI C67: Internal Memory

TI C67: External Memory

GlueMatrices

Detection

Page 31: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Future Work

Achieve real-time rates– Additional functional units (that can be used

efficiently!)– Eliminate communication stalls between kernels– Support for matrix transposes and bit-level operations

Power and area constraints– Low power stream processing– Scaling the architecture for handsets

Scalability with data rates– Boundaries of the architecture

Handset algorithms

Page 32: DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Conclusions

Future wireless communications algorithms– exceed the capabilities of current DSPs– require flexibility to change algorithms and parameters– require efficient use of resources because of delay,

area, and power constraints Architectural developments are needed for

future DSPs– Stream processing is a promising approach– Additional hardware acceleration, akin to Viterbi

coprocessor on C64? The insights gained from our designs can be

applied to DSPs and other processors with constraints on delay, area and power.