Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
RICE UNIVERSITY
Flexible wireless communication architectures
Sridhar Rajagopal
Department of Electrical and Computer EngineeringRice University, Houston TX
Faculty Candidate Seminar – Southern Methodist UniversityApril 23, 2003
This work has been supported in part by NSF, Nokia and Texas Instruments
2RICE UNIVERSITY
Future wireless devices demand flexibility
Ø Multiple algorithms and environments supported in same device
Ø High data rate mobile devices with multimedia
ð Flexible algorithms: Multiple antennas, complex signal processing
ð Flexible architectures: High performance (Mbps), low power (mW)
Ø Fast design with structured exploration
Bluetooth/Home Networks
Wireless Cellular
Wireless LAN
3RICE UNIVERSITY
Flexibility needed in different layers
Physical Layer
MAC Layer
Network Layer
Application Layer Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/
Analog RF
Flexible Algorithms
Mapping
Flexible Architectures
4RICE UNIVERSITY
Research vision: Attain flexibility
Ø Algorithms:ð Flexibility: support variety of sophisticated algorithms
Ø Architectures:ð Flexibility: adapts hardware to algorithms
Ø Fast, structured design exploration
Design me
5RICE UNIVERSITY
Contributions: Algorithms
Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]Ø Matrix-inversionsØ Numerical techniques ð conjugate-gradient descent for complexity reduction
Multi-user detection: [ISCAS’01]Ø Block-based computation to streaming computationsð Pipelining, lower memory requirements
Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]
6RICE UNIVERSITY
Contributions: Architectures
Heterogeneous DSP-FPGA system designs: [ICSPAT’00]
Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmeticwith Most Significant Digit First computation
[Ph.D. Thesis]
Scalable Wireless Application-specific Processors (SWAPs)
Rapid, structured architectures with flexibility-performance tradeoffs
7RICE UNIVERSITY
Scalable Wireless Application-specific Processors
Ø Family of flexible programmable processorsð Clusters of ALUsð High performance by supporting 100’s of ALUsð Can provide customization for various algorithmsð Adapts (“swaps”) architecture dynamically for power
+
?
**
+
**
+
**
+
**
…? ? ?
Scale Clusters
ScaleALUs
8RICE UNIVERSITY
Rapid, structured design for SWAPs
Low “complexity”, parallel, fixed pointalgorithms
Architecture Exploration ASIC
designapply
DSPdesign
apply
SWAPs+?**
+
**
+
**
+
**
…? ? ?
9RICE UNIVERSITY
Research vision summary
Ø Provide a structured framework to rapidly explore:ð flexible, high performance, low power architectures (SWAPs)
Ø Efficient algorithm design for mapping to SWAPs
Ø Understanding of algorithms, DSPs and ASICs used
Ø Flexibility-performance trade-offs
Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer
architecture, Computer arithmetic, Circuits, CAD, Compilers
10RICE UNIVERSITY
Talk Outline
Ø Research vision
Ø SWAPs - Background
Ø Algorithm design for SWAPs
Ø Architecture design for SWAPs
Ø Current and Future Research Goals
11RICE UNIVERSITY
SWAPs borrow from DSPs
Ø DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX)
Ø Not enough ALUs for GOPs of computation-- Need 100’sð TI C6x has 8 ALUs
Ø Why not more ALUs?ð Cannot support more registers (area,ports)ð Difficult to find ILP as ALUs increase
32
RegisterFile
1 ALURF 4 16
12RICE UNIVERSITY
SWAPs borrow from ASICs
Exploit data parallelism (DP)ð Available in many wireless algorithmsð This is what ASICs do!
int i,a[N],b[N],sum[N]; // 32 bitsshort int c[N],d[N],diff[N]; // 16 bits packed
for (i = 0; i< 1024; ++i){
sum[i] = a[i] + b[i];diff[i] = c[i] - d[i];
}
ILP
DP
Subword
13RICE UNIVERSITY
SWAPs borrow from stream processors
Kernel
Viterbidecoding
StreamInput Data Output Data
Correlator channelestimation
receivedsignal
Matchedfilter
InterferenceCancellation
Decoded bits
Ø Kernels (computation) and streams (communication)
Ø Use local data in clusters providing GOPs support
Ø Imagine stream processor at Stanford [Rixner’01]
Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.
14RICE UNIVERSITY
SWAPs are multi-cluster DSPs
+++***
InternalMemory
ILP
Memory: Stream Register File (SRF)
DSP(1 cluster)
+++***
+++***
+++***
+++***
…ILP
DP
SWAPsadapt clusters to DP
Identical clusters, same operations.Power-down unused FUs, clusters
15RICE UNIVERSITY
Arithmetic clusters in SWAPs
Intercluster NetworkComm. Unit
Scratchpad (indexed accesses)
SRF
From/To SRF
Cross Point
Distributed Register Files(supports more ALUs)
+
+
+*
*/
+/
+
+
+*
*/
+
/
16RICE UNIVERSITY
Talk Outline
Ø Research vision
Ø SWAPs Background
Ø Algorithm design for SWAPs
Ø Architecture design for SWAPs
Ø Current and Future Research Goals
17RICE UNIVERSITY
SWAPs: Physical layer algorithms
Antenna
Channelestimation
Detection DecodingHigher
(MAC/Network/OS)
Layers
RF Front-end
Baseband processing
Complex signal processing algorithms with GOPs of computation
18RICE UNIVERSITY
SWAP mapping example: Viterbi decoding
Ø Multiple antenna systems (MIMO systems)ð Complexity exponential with transmit x receive antennas
Ø Estimation: Linear MMSE, blind, conjugate gradient….
Ø Detection: FFT, (blind) interference cancellation….
Ø Decoding: Viterbi, Turbo, LDPC…. & joint schemes
Ø SWAP flexibility lets you use the best algorithms for the situation
Example for concept demonstration: Viterbi decoding
19RICE UNIVERSITY
Parallel Viterbi Decoding for SWAPs
Ø Add-Compare-Select (ACS) : trellis interconnect : computationsð Parallelism depends on constraint length (#states)
Ø Traceback: searchingð Conventional
• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture
ð Use Register Exchange (RE) • parallel solution
ACS Unit
Traceback Unit
Detectedbits
Decodedbits
20RICE UNIVERSITY
Parallel Viterbi needs re-ordering for SWAPs
Exploiting Viterbi DP in SWAPs:ðUse RE instead of regular traceback ðRe-order ACS, RE
X(0)X(1)
X(2)X(3)
X(4)X(5)
X(6)X(7)X(8)X(9)
X(10)X(11)
X(12)X(13)
X(14)X(15)
X(0)X(1)
X(2)X(3)
X(4)X(5)
X(6)X(7)X(8)X(9)
X(10)X(11)
X(12)X(13)
X(14)X(15)
X(0)X(2)
X(4)X(6)
X(8)X(10)
X(12)X(14)X(1)X(3)
X(5)X(7)
X(9)X(11)
X(13)X(15)
X(0)X(1)
X(2)X(3)
X(4)X(5)
X(6)X(7)X(8)X(9)
X(10)X(11)
X(12)X(13)
X(14)X(15)
DP
vector
Regular ACSACS in SWAPs
21RICE UNIVERSITY
Talk Outline
Ø Research vision
Ø SWAP Background
Ø Algorithm design for SWAPs
Ø Architecture design for SWAPs
Ø Current and Future Research Goals
22RICE UNIVERSITY
SWAP architecture design
More clusters better than more ALUs/per cluster (if #clusters > 2)
1. Decide how many clustersð Exploit DP
2. Decide what to put within each clusterð Maximize ILP with high functional unit efficiencyð Search design space with “explore” tool
Time-power-area characterization
+?**
+
**
+
**
+
**
…ILP
DP
? ? ?
23RICE UNIVERSITY
Design a SWAP cluster: “Explore”
Auto-exploration of adders and multipliers for “ACS"
1
2
3
4
5
1
2
3
4
5
40
60
80
100
120
140
160
(43,58)
(54,59)
(39,41)
(62,62)
(47,43)
#Multipliers
(40,32)
(70,59)
(65,45)
(49,33)
(39,27)
(80,34)
(73,41)
(61,33)
(48,26)
(39,22)
(50,22)
(85,24)
(76,33)
(60,26)
#Adders
(61,22)
(85,17)
(72,22)
(72,19)
(85,13)
(85,11)
Inst
ruct
ion
coun
t
(Adder util%, Multiplier util%)
24RICE UNIVERSITY
“Explore” tool benefits
Ø Instruction count vs. ALU efficiencyðWhat goes inside each cluster
Ø Design customized application-specific unitsð Better performance with increased ALU utilization
Ø Explore multiple algorithmsð turn off functional units not in use for given kernelð Vdd-gating, clock gating techniques
25RICE UNIVERSITY
Example for SWAP architecture design
Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters
Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters
Explore Algorithm 3 : 2 adders, 2 multipliers, 64 clusters
Explore Algorithm 4 : 2 adders, 2 multipliers, 16 clusters
Chosen Architecture: 4 adders, 3 multipliers, 64 clusters
ILP
DP
26RICE UNIVERSITY
SWAP flexibility provides power savings
Ø Multiple algorithmsð Different ALU, cluster requirements
Ø Turning off ALUs ( –add –mul compiler options)ð Use the right #ALUs from “explore” tool
Ø Turning off clustersð Data across SRF of all clustersð Cluster only has access to its own SRFð Next kernel may need data from SRF of other clustersð Reconfiguration support needs to be provided
27RICE UNIVERSITY
SWAPs provide cluster reconfiguration
SRF
Clusters
Mux-DemuxNetwork
WithStreambuffers
MDX2 MDX2
MDX1
LATCH LATCH LATCH LATCH
Additional latency (few cycles) due to microcontroller stalls
- Minimal loss in performance
28RICE UNIVERSITY
Cluster reconfiguration for Viterbi
Packet 1Constraint length 7
(16 clusters)
Packet 2Constraint length 9
(64 clusters)
Packet 3Constraint length 5
(4 clusters)
DP Can be turned OFF
29RICE UNIVERSITY
64-bit Rate ½
Packet 1K = 7
Packet 2K = 9
Packet 3K = 5
Kernels(Computation)
No Data Memoryaccesses
Exe
cution
Tim
e (c
ycle
s)Clusters Memory
SWAPs provide flexibility at negligible overhead
30RICE UNIVERSITY
SWAP exploration for Viterbi decoding
1 10 1001
10
100
1000
Number of clusters
Freq
uen
cy n
eed
ed t
o a
ttai
n r
eal-
tim
e (i
n M
Hz)
K = 9K = 7 K = 5Different SWAPs
(Without reconfiguration)Same SWAP
(With reconfiguration)
DSP
Ideal C64x (w/o co-proc) needs ~200 MHz for real-time
Max DP
31RICE UNIVERSITY
SWAPs : Salient features
Ø 1-2 orders of magnitude better than a DSP
Ø Any constraint length ⇒ 10 MHz at 128 Kbps
Ø Same code for all constraint lengths ð no need to re-compile or load another codeð as long as parallelism/cluster ratio is constant
Ø Power savings due to dynamic cluster scaling
32RICE UNIVERSITY
Expected SWAP power consumption
Ø Power model based on [Khailany’03]Ø 64 clusters and 1 multiplier per cluster:ð 0.13 micron, 1.2 Vð Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW)ð Area: ~53.7 mm2
Ø 10 MHz, 128 Kbps with reconfiguration
Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of theNinth Symposium on High Performance Computer Architecture, February 8-12, 2003
0 10 20 30 40 50 60 700102030405060708090
Active Clusters (max 64)P
ow
er (
inm
W)Viterbi Clusters Used Peak Power
K = 9 64 ~90 mW
K = 7 16 ~28.57 mW
K = 5 4 ~13.8 mW
overhead 0 ~8.1 mW
DSP, K = 9 1 ~200 mW
33RICE UNIVERSITY
Multiuser Estimation-Detection+Decoding
Real-time target : 128 Kbps per user
1 10 10010
100
1000
10000
100000
Number of clustersFreq
uenc
y ne
eded
to
atta
in r
eal-
time
(in M
Hz)
FASTMEDIUMSLOW
32-user base-station
Mobile
DSP
Ideal C64x (w/o co-proc) needs ~15 GHz for real-time
Fading scenarios
34RICE UNIVERSITY
Expected SWAP power : base-station
Ø 32 user base-station with 3 X’s per cluster and 64 clusters:ð 0.13 micron, 1.2 Vð Peak Active Power: ~18.19 mW for 1 MHz (increased X)ð Area: ~93.4 mm2
Ø Total Peak Base-station power consumption:ð ~18.19 W at 1 GHz for 32 users at 128 Kbps/user
35RICE UNIVERSITY
Talk Outline
Ø Research vision
Ø SWAP Background
Ø Algorithm design for SWAPs
Ø Architecture design for SWAPs
Ø Current and Future Research Goals
36RICE UNIVERSITY
Current research: Flexibility vs. performance
SWAPs: 128 Kbps at ~10-100 mW for Viterbið Borrow DP from ASICs!
Ø suitable for base-stationsð Flexibility more important than power
Ø suitable for mobile devicesð Power constraints tighterð can be customized for further power savings
Handset SWAPs (H-SWAPs)ð Borrow Task pipelining from ASICs!ð Application-specific units and specialized comm. network
37RICE UNIVERSITY
Handset SWAPs: H-SWAPs
Ø Trade Data Parallelism for Task Pipelining
SRF
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
…
DP
SWAPs(max. clusters
and reconfigure)
+++*
+++*
+++*
+++*
LimitedDP
SWAPlet(limit
clusters)
+++*
+++*
+++*
+++*
LimitedDP
++*
++*
++*
++*
LimitedDP
++++
++++
LimitedDP
H-SWAPs(collection of customized
SWAPlets)
38RICE UNIVERSITY
Sample points in architecture exploration
DSPs(1 cluster)
ILPSubword
ILPSubword
DP
SWAPs(multiple)
H-SWAPs(optimized for handsets)
ILPSubword
DP Task PipeliningCustom ALUs
Programmable solutions with increased customization
Performance, Power benefits(with decreasing flexibility)
39RICE UNIVERSITY
Future: Efficient algorithms and mapping
MultipathChannel
EqualizerMRC Decoder
DetectorDemodulator
Non-Coherent
STC
Beam-forming
CoherentSTC
ChannelEstimator
Channel
Turbo Equalizer
Multiple antenna systems with 1-2 orders-of-magnitude higher complexity
40RICE UNIVERSITY
Future research: Architectures
Generalized and structured framework and tools ð Joint algorithm-architecture explorationð Area-time-power-flexibility tradeoffs
Potential applications: embedded systemsØ Image and Video processing: ð Cameras : variety of compression algorithms
Ø Biomedical applications:
ð Hearing aids: DSP running on body heat*
Ø Sensor networksð Compression of data before transmission
*Quote: Gene Frantz, TI Fellow
41RICE UNIVERSITY
SWAPs: Flexibility, Performance, Power
Ø Need flexibility in future wireless devicesð Algorithms and Architectures
Ø Rapid Exploration for Scalable, Wireless Application-specific Processorsð Structured approach with flexibility-performance trade-offs
Ø SWAPs - flexibility, high performance and low powerð Exploit data parallelism like ASICsð 1-2 orders better performance than DSPsð Turn off unused clusters and unused ALUs for low power