Sridhar Rajagopal - Rice University Electrical and Computer …sridhar/ppts/smu-talk.pdf · 2003....

Preview:

Citation preview

RICE UNIVERSITY

Flexible wireless communication architectures

Sridhar Rajagopal

Department of Electrical and Computer EngineeringRice University, Houston TX

Faculty Candidate Seminar – Southern Methodist UniversityApril 23, 2003

This work has been supported in part by NSF, Nokia and Texas Instruments

2RICE UNIVERSITY

Future wireless devices demand flexibility

Ø Multiple algorithms and environments supported in same device

Ø High data rate mobile devices with multimedia

ð Flexible algorithms: Multiple antennas, complex signal processing

ð Flexible architectures: High performance (Mbps), low power (mW)

Ø Fast design with structured exploration

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

3RICE UNIVERSITY

Flexibility needed in different layers

Physical Layer

MAC Layer

Network Layer

Application Layer Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/

Analog RF

Flexible Algorithms

Mapping

Flexible Architectures

4RICE UNIVERSITY

Research vision: Attain flexibility

Ø Algorithms:ð Flexibility: support variety of sophisticated algorithms

Ø Architectures:ð Flexibility: adapts hardware to algorithms

Ø Fast, structured design exploration

Design me

5RICE UNIVERSITY

Contributions: Algorithms

Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]Ø Matrix-inversionsØ Numerical techniques ð conjugate-gradient descent for complexity reduction

Multi-user detection: [ISCAS’01]Ø Block-based computation to streaming computationsð Pipelining, lower memory requirements

Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]

6RICE UNIVERSITY

Contributions: Architectures

Heterogeneous DSP-FPGA system designs: [ICSPAT’00]

Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmeticwith Most Significant Digit First computation

[Ph.D. Thesis]

Scalable Wireless Application-specific Processors (SWAPs)

Rapid, structured architectures with flexibility-performance tradeoffs

7RICE UNIVERSITY

Scalable Wireless Application-specific Processors

Ø Family of flexible programmable processorsð Clusters of ALUsð High performance by supporting 100’s of ALUsð Can provide customization for various algorithmsð Adapts (“swaps”) architecture dynamically for power

+

?

**

+

**

+

**

+

**

…? ? ?

Scale Clusters

ScaleALUs

8RICE UNIVERSITY

Rapid, structured design for SWAPs

Low “complexity”, parallel, fixed pointalgorithms

Architecture Exploration ASIC

designapply

DSPdesign

apply

SWAPs+?**

+

**

+

**

+

**

…? ? ?

9RICE UNIVERSITY

Research vision summary

Ø Provide a structured framework to rapidly explore:ð flexible, high performance, low power architectures (SWAPs)

Ø Efficient algorithm design for mapping to SWAPs

Ø Understanding of algorithms, DSPs and ASICs used

Ø Flexibility-performance trade-offs

Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer

architecture, Computer arithmetic, Circuits, CAD, Compilers

10RICE UNIVERSITY

Talk Outline

Ø Research vision

Ø SWAPs - Background

Ø Algorithm design for SWAPs

Ø Architecture design for SWAPs

Ø Current and Future Research Goals

11RICE UNIVERSITY

SWAPs borrow from DSPs

Ø DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX)

Ø Not enough ALUs for GOPs of computation-- Need 100’sð TI C6x has 8 ALUs

Ø Why not more ALUs?ð Cannot support more registers (area,ports)ð Difficult to find ILP as ALUs increase

32

RegisterFile

1 ALURF 4 16

12RICE UNIVERSITY

SWAPs borrow from ASICs

Exploit data parallelism (DP)ð Available in many wireless algorithmsð This is what ASICs do!

int i,a[N],b[N],sum[N]; // 32 bitsshort int c[N],d[N],diff[N]; // 16 bits packed

for (i = 0; i< 1024; ++i){

sum[i] = a[i] + b[i];diff[i] = c[i] - d[i];

}

ILP

DP

Subword

13RICE UNIVERSITY

SWAPs borrow from stream processors

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

Ø Kernels (computation) and streams (communication)

Ø Use local data in clusters providing GOPs support

Ø Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

14RICE UNIVERSITY

SWAPs are multi-cluster DSPs

+++***

InternalMemory

ILP

Memory: Stream Register File (SRF)

DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILP

DP

SWAPsadapt clusters to DP

Identical clusters, same operations.Power-down unused FUs, clusters

15RICE UNIVERSITY

Arithmetic clusters in SWAPs

Intercluster NetworkComm. Unit

Scratchpad (indexed accesses)

SRF

From/To SRF

Cross Point

Distributed Register Files(supports more ALUs)

+

+

+*

*/

+/

+

+

+*

*/

+

/

16RICE UNIVERSITY

Talk Outline

Ø Research vision

Ø SWAPs Background

Ø Algorithm design for SWAPs

Ø Architecture design for SWAPs

Ø Current and Future Research Goals

17RICE UNIVERSITY

SWAPs: Physical layer algorithms

Antenna

Channelestimation

Detection DecodingHigher

(MAC/Network/OS)

Layers

RF Front-end

Baseband processing

Complex signal processing algorithms with GOPs of computation

18RICE UNIVERSITY

SWAP mapping example: Viterbi decoding

Ø Multiple antenna systems (MIMO systems)ð Complexity exponential with transmit x receive antennas

Ø Estimation: Linear MMSE, blind, conjugate gradient….

Ø Detection: FFT, (blind) interference cancellation….

Ø Decoding: Viterbi, Turbo, LDPC…. & joint schemes

Ø SWAP flexibility lets you use the best algorithms for the situation

Example for concept demonstration: Viterbi decoding

19RICE UNIVERSITY

Parallel Viterbi Decoding for SWAPs

Ø Add-Compare-Select (ACS) : trellis interconnect : computationsð Parallelism depends on constraint length (#states)

Ø Traceback: searchingð Conventional

• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture

ð Use Register Exchange (RE) • parallel solution

ACS Unit

Traceback Unit

Detectedbits

Decodedbits

20RICE UNIVERSITY

Parallel Viterbi needs re-ordering for SWAPs

Exploiting Viterbi DP in SWAPs:ðUse RE instead of regular traceback ðRe-order ACS, RE

X(0)X(1)

X(2)X(3)

X(4)X(5)

X(6)X(7)X(8)X(9)

X(10)X(11)

X(12)X(13)

X(14)X(15)

X(0)X(1)

X(2)X(3)

X(4)X(5)

X(6)X(7)X(8)X(9)

X(10)X(11)

X(12)X(13)

X(14)X(15)

X(0)X(2)

X(4)X(6)

X(8)X(10)

X(12)X(14)X(1)X(3)

X(5)X(7)

X(9)X(11)

X(13)X(15)

X(0)X(1)

X(2)X(3)

X(4)X(5)

X(6)X(7)X(8)X(9)

X(10)X(11)

X(12)X(13)

X(14)X(15)

DP

vector

Regular ACSACS in SWAPs

21RICE UNIVERSITY

Talk Outline

Ø Research vision

Ø SWAP Background

Ø Algorithm design for SWAPs

Ø Architecture design for SWAPs

Ø Current and Future Research Goals

22RICE UNIVERSITY

SWAP architecture design

More clusters better than more ALUs/per cluster (if #clusters > 2)

1. Decide how many clustersð Exploit DP

2. Decide what to put within each clusterð Maximize ILP with high functional unit efficiencyð Search design space with “explore” tool

Time-power-area characterization

+?**

+

**

+

**

+

**

…ILP

DP

? ? ?

23RICE UNIVERSITY

Design a SWAP cluster: “Explore”

Auto-exploration of adders and multipliers for “ACS"

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)

(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Inst

ruct

ion

coun

t

(Adder util%, Multiplier util%)

24RICE UNIVERSITY

“Explore” tool benefits

Ø Instruction count vs. ALU efficiencyðWhat goes inside each cluster

Ø Design customized application-specific unitsð Better performance with increased ALU utilization

Ø Explore multiple algorithmsð turn off functional units not in use for given kernelð Vdd-gating, clock gating techniques

25RICE UNIVERSITY

Example for SWAP architecture design

Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters

Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters

Explore Algorithm 3 : 2 adders, 2 multipliers, 64 clusters

Explore Algorithm 4 : 2 adders, 2 multipliers, 16 clusters

Chosen Architecture: 4 adders, 3 multipliers, 64 clusters

ILP

DP

26RICE UNIVERSITY

SWAP flexibility provides power savings

Ø Multiple algorithmsð Different ALU, cluster requirements

Ø Turning off ALUs ( –add –mul compiler options)ð Use the right #ALUs from “explore” tool

Ø Turning off clustersð Data across SRF of all clustersð Cluster only has access to its own SRFð Next kernel may need data from SRF of other clustersð Reconfiguration support needs to be provided

27RICE UNIVERSITY

SWAPs provide cluster reconfiguration

SRF

Clusters

Mux-DemuxNetwork

WithStreambuffers

MDX2 MDX2

MDX1

LATCH LATCH LATCH LATCH

Additional latency (few cycles) due to microcontroller stalls

- Minimal loss in performance

28RICE UNIVERSITY

Cluster reconfiguration for Viterbi

Packet 1Constraint length 7

(16 clusters)

Packet 2Constraint length 9

(64 clusters)

Packet 3Constraint length 5

(4 clusters)

DP Can be turned OFF

29RICE UNIVERSITY

64-bit Rate ½

Packet 1K = 7

Packet 2K = 9

Packet 3K = 5

Kernels(Computation)

No Data Memoryaccesses

Exe

cution

Tim

e (c

ycle

s)Clusters Memory

SWAPs provide flexibility at negligible overhead

30RICE UNIVERSITY

SWAP exploration for Viterbi decoding

1 10 1001

10

100

1000

Number of clusters

Freq

uen

cy n

eed

ed t

o a

ttai

n r

eal-

tim

e (i

n M

Hz)

K = 9K = 7 K = 5Different SWAPs

(Without reconfiguration)Same SWAP

(With reconfiguration)

DSP

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

Max DP

31RICE UNIVERSITY

SWAPs : Salient features

Ø 1-2 orders of magnitude better than a DSP

Ø Any constraint length ⇒ 10 MHz at 128 Kbps

Ø Same code for all constraint lengths ð no need to re-compile or load another codeð as long as parallelism/cluster ratio is constant

Ø Power savings due to dynamic cluster scaling

32RICE UNIVERSITY

Expected SWAP power consumption

Ø Power model based on [Khailany’03]Ø 64 clusters and 1 multiplier per cluster:ð 0.13 micron, 1.2 Vð Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW)ð Area: ~53.7 mm2

Ø 10 MHz, 128 Kbps with reconfiguration

Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of theNinth Symposium on High Performance Computer Architecture, February 8-12, 2003

0 10 20 30 40 50 60 700102030405060708090

Active Clusters (max 64)P

ow

er (

inm

W)Viterbi Clusters Used Peak Power

K = 9 64 ~90 mW

K = 7 16 ~28.57 mW

K = 5 4 ~13.8 mW

overhead 0 ~8.1 mW

DSP, K = 9 1 ~200 mW

33RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

1 10 10010

100

1000

10000

100000

Number of clustersFreq

uenc

y ne

eded

to

atta

in r

eal-

time

(in M

Hz)

FASTMEDIUMSLOW

32-user base-station

Mobile

DSP

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

Fading scenarios

34RICE UNIVERSITY

Expected SWAP power : base-station

Ø 32 user base-station with 3 X’s per cluster and 64 clusters:ð 0.13 micron, 1.2 Vð Peak Active Power: ~18.19 mW for 1 MHz (increased X)ð Area: ~93.4 mm2

Ø Total Peak Base-station power consumption:ð ~18.19 W at 1 GHz for 32 users at 128 Kbps/user

35RICE UNIVERSITY

Talk Outline

Ø Research vision

Ø SWAP Background

Ø Algorithm design for SWAPs

Ø Architecture design for SWAPs

Ø Current and Future Research Goals

36RICE UNIVERSITY

Current research: Flexibility vs. performance

SWAPs: 128 Kbps at ~10-100 mW for Viterbið Borrow DP from ASICs!

Ø suitable for base-stationsð Flexibility more important than power

Ø suitable for mobile devicesð Power constraints tighterð can be customized for further power savings

Handset SWAPs (H-SWAPs)ð Borrow Task pipelining from ASICs!ð Application-specific units and specialized comm. network

37RICE UNIVERSITY

Handset SWAPs: H-SWAPs

Ø Trade Data Parallelism for Task Pipelining

SRF

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

SWAPlet(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

SWAPlets)

38RICE UNIVERSITY

Sample points in architecture exploration

DSPs(1 cluster)

ILPSubword

ILPSubword

DP

SWAPs(multiple)

H-SWAPs(optimized for handsets)

ILPSubword

DP Task PipeliningCustom ALUs

Programmable solutions with increased customization

Performance, Power benefits(with decreasing flexibility)

39RICE UNIVERSITY

Future: Efficient algorithms and mapping

MultipathChannel

EqualizerMRC Decoder

DetectorDemodulator

Non-Coherent

STC

Beam-forming

CoherentSTC

ChannelEstimator

Channel

Turbo Equalizer

Multiple antenna systems with 1-2 orders-of-magnitude higher complexity

40RICE UNIVERSITY

Future research: Architectures

Generalized and structured framework and tools ð Joint algorithm-architecture explorationð Area-time-power-flexibility tradeoffs

Potential applications: embedded systemsØ Image and Video processing: ð Cameras : variety of compression algorithms

Ø Biomedical applications:

ð Hearing aids: DSP running on body heat*

Ø Sensor networksð Compression of data before transmission

*Quote: Gene Frantz, TI Fellow

41RICE UNIVERSITY

SWAPs: Flexibility, Performance, Power

Ø Need flexibility in future wireless devicesð Algorithms and Architectures

Ø Rapid Exploration for Scalable, Wireless Application-specific Processorsð Structured approach with flexibility-performance trade-offs

Ø SWAPs - flexibility, high performance and low powerð Exploit data parallelism like ASICsð 1-2 orders better performance than DSPsð Turn off unused clusters and unused ALUs for low power

Recommended