View
34
Download
0
Category
Tags:
Preview:
DESCRIPTION
Algorithms and Architecture s for Future Wireless Base-Stations. Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000. This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF. Overview. Future Base-Stations - PowerPoint PPT Presentation
Citation preview
Algorithms and Architectures for Future Wireless Base-Stations
Sridhar Rajagopal and Joseph CavallaroECE Department Rice University
April 19, 2000
This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF
4/19/00 TI Meeting 2
Overview
Future Base-Stations
Current DSP Implementation
Our Approach– Make Algorithms Computationally effective
– Task Partitioning for pipelining, parallelism
Processor Design for Accelerating Wireless
4/19/00 TI Meeting 3
Evolution of Wireless Comm
First Generation
Voice
Second/Current Generation
Voice + Low-rate Data (9.6Kbps)
Third Generation +Voice + High-rate Data (2 Mbps) + Multimedia
W-CDMA
4/19/00 TI Meeting 4
Communication System Uplink
Direct PathReflected Paths
Noise +MAI
User 1
User 2
Base Station
4/19/00 TI Meeting 5
Main Processing Blocks
Channel Estimation Detection Decoding
Baseband Layer of Base-Station Receiver
4/19/00 TI Meeting 6
Proposed Base-Station No Multiuser Detection
TI's Wireless Basestation (http://www.ti.com/sc/docs/psheets/diagrams/basestat.htm)
4/19/00 TI Meeting 7
Real -Time Requirements
Multiple Data Rates by Varying Spreading Factors
Detection needs to be done in real-time– 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128
Kbps
SpreadingFactor
Number ofBits / Frame
Data RateRequirement
4 10240 1024 Kbps32 1280 128 Kbps
256 160 16 Kbps
4/19/00 TI Meeting 8
Current DSP Implementation
9 10 11 12 13 14 150
2
4
6
8
10
12
14
16
18x 10
4
Number of Users
Da
ta R
ate
s A
ch
iev
ed
Data Rate Comparisons for Matched Filter and Multiuser Detector
Multiuser Detector(C67) Matched Filter(C67) Multiuser Detector(C64)*Matched Filter(C64)*
Targeted Data Rate
Targeted Data Rate = 128Kbps
C67 at 166MHz
Projected (8x)
4/19/00 TI Meeting 9
Complexity
Algorithm Choice Limited by Complexity– Multistage reduces data rate by half.
Main Features– Matrix based operations
– High levels of parallelism
– Bit level computations
32x32 problem size for the Detector shown
Estimation, Decoding assumed pipelined.
4/19/00 TI Meeting 10
Reasons
Sophisticated, Compute-Intensive Algorithms
Need more MIPs/FLOPs performance
Unable to fully exploit pipelining or parallelism
Bit - level computations / Storage
4/19/00 TI Meeting 11
Our Approach
Make algorithms computationally effective
– without sacrificing error rate performance
Task Partitioning on Multiple Processing Elements– DSPs : Core
– FPGAs : Application Specific / Bit-level Computations
Processor with reconfigurable support and extensions for
wireless
4/19/00 TI Meeting 12
Algorithms
Channel Estimation– Avoid inversion by iterative scheme
Detection– Avoid block-based detection by pipelining
4/19/00 TI Meeting 13
Computations Involved
Model
Compute Correlation Matrices
rbRH
iibr L 1
bbRT
iibb L 1
CrRb
N
i
K
i
2 Bits of K async. users aligned at times I and I-1
Received bits of spreading length N for K users
iiii bAr ri
bibi+1
time
delay
4/19/00 TI Meeting 14
Multishot Detection
b
b
b
b
A
AAAA
DK
D
K
0
10
10
r
,
,1
1,
1,1
000
00
00
CAKDND
Multishot Detection
AAA 10i
Solve for the channel estimate, Ai
RAR bribb
CANK
i
2
4/19/00 TI Meeting 15
Differencing Multistage Detection
Stage 0- Matched Filter
Stage 1
Successive Stages
)(
]Re[
)(
]Re[
11
001
00
0
ysignd
dSAAyy
ysignd
rAy
H
H
)(
]Re[11
1
1
ll
lHll
lll
ysignd
xSAAyy
ddx
S=diag(AHA)
y - soft decision
d - detected bits
(hard decision)
4/19/00 TI Meeting 16
Iterative Scheme
Tracking
Method of Steepest Descent
Stable convergence behavior
Same Performance
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)*( brbb RRAAA rbR
H
iibr bbR
T
iibb
RAR bribb *
4/19/00 TI Meeting 17
Simulations - AWGN Channel
Detection Window =
12
SINR = 0
Paths =3
Preamble L =150
Spreading N = 31
Users K = 15
10000 bits/userMF – Matched Filter
ML- Maximum
Likelihood
ACT – using inversion4 5 6 7 8 9 10 11 1210
-3
10-2
10-1 Comparison of Bit Error Rates (BER)
Signal to Noise Ratio (SNR)
BER
MF ActMFML ActML
O(K2N)
O(K3+K2N)
4/19/00 TI Meeting 18
Fading Channel with Tracking
4 5 6 7 8 9 10 11 1210
-3
10-2
10-1
100
SNR
BE
R
MF - Static MF - TrackingML - Static ML - Tracking
Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths
4/19/00 TI Meeting 19
Block Based Detector
1 12
1 12
1 12
1 12
11 22
11 22
11 22
11 22
Matched Filter
Stage 1
Stage 2
Stage 3
Matched Filter
Stage 1
Stage 2
Stage 3
Bits 2-11
Bits 12-21
4/19/00 TI Meeting 20
Pipelined Detector
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Matched Filter
Stage 1
Stage 2
Stage 3
1 2 3 4 5 6 7 8 9 10 11 12
4/19/00 TI Meeting 21
Task Decomposition [Asilomar99]
Matrix Products
InverseCorrelation Matrices (Per
Bit)
Rbr[I]O(KN)
A0HA1
O(K2N)
AHrO(KND)
A1HA1
O(K2N)
A0HA0
O(K2N)RbbAH = Rbr[I]O(K2N)
Multistage Detection
(Per Window)
O(DK2Me)
b
Pilot
Data
MUX
d
Data’MUX
RbbAH
= Rbr[R]O(K2N)
d
Rbr[R]O(KN)
Rbb
O(K2)
Block I Block II Block III
Block IV
Channel Estimation Matched Filter
Multistage Detector
4/19/00 TI Meeting 22
Achieved Data Rates
9 10 11 12 13 14 150
0.5
1
1.5
2
2.5
3x 10
5
Number of Users
Dat
a R
ates
Data Rates for Different Levels of Pipelining and Parallelism
(Parallel A) (Parallel+Pipe B)(Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B
Data Rate Requirement = 128 Kbps
4/19/00 TI Meeting 23
VLSI Implementation
Channel Estimation as a Case Study
Area - Time Efficient Architecture
Real - Time Implementation
Bit- Level Computations - FPGAs
Core Operations - DSPs
4/19/00 TI Meeting 24
Motivation for Architecture
Wireless, the next wave after Multimedia
Highly Compute-Intensive Algorithms
Real-Time Requirements
4/19/00 TI Meeting 25
Outline
Processor Core with Reconfigurable Support
Permutation Based Interleaved Memory
Processor Architecture -EPIC
Instruction Set Extensions
Truncated Multipliers
Software Support Needed
4/19/00 TI Meeting 26
Characteristics of Wireless Algorithms
Massive Parallelism
Bit-level Computations
Matrix Based Operations
Memory Intensive
Complex-valued Data
Approximate Computations
4/19/00 TI Meeting 27
What’s wrong with Current Architectures for these applications?
4/19/00 TI Meeting 28
Problems with Current Architectures
UltraSPARC, C6x, MMX, IA-64
Not enough MIPs/FLOPs
Unable to fully exploit parallelism
Bit Level Computations
Memory Bottlenecks
Specialized Instructions for Wireless Communications
4/19/00 TI Meeting 29
Why Reconfigurable
Adapt algorithms to environment
Seamless and Continuous Data Processing during
Handoffs
Home Area Wireless LAN
High Speed Office Wireless LAN
Outdoor CDMA Cellular Network
4/19/00 TI Meeting 30
Reconfigurable Support
User InterfaceTranslation
SynchronizationTransport Network
OSILayers
3-7
Data Link Layer(Converts Frames
to Bits)
OSILayer
2
Physical Layer(hardware;
raw bit stream)
OSILayer
1
4/19/00 TI Meeting 31
Different Protocols
Source Coding Channel Coding
Channel
Decoding
Source
Decoding
Multiuser
Detection
Channel
Estimation
MPEG-4, H.723 - Voice,Multimedia
Convolutional,Turbo - Channel Coding
4/19/00 TI Meeting 32
A New Architecture
Processor Core
(GPP/DSP)
Cache
Q Q
Crossbar
Reconfigurable
Logic
Real-Time I/O
Bit Stream
Main
Memory
RF Unit
Processor
Add-on PCMCIACard
4/19/00 TI Meeting 33
Why Reconfigurable
Process initial bit level computations
Optimize for fast I/O transfer
Reconfigurable
Logic
Real-Time I/O
Bit StreamRF Unit
4/19/00 TI Meeting 34
Reconfigurable Support
Configuration Caches
2 64-bit data buses1 64-bit address bus
ControlBlocks
SequencerGARP Architecture at UC,Berkeley
Boolean values 64-bit Datapath Fast I/O
4/19/00 TI Meeting 35
Reconfigurable Support
Wide Path to Memory
– Data Transfer
– Minimize Load Times
Configuration Caches
– Recently Displaced Configurations(5 cycles)
– Can hold 4 full size Configurations
Independent Execution
4/19/00 TI Meeting 36
Reconfigurable Support
Access to same Memory System as Processor
– Minimize overhead
When idle
– Load Configurations
– Transfer Data
4/19/00 TI Meeting 37
Memory Interface
Access to Main Memory and L1 Data Cache– Large, fast Memory Store
Memory Prefetch Queues for Sequential Accesses– Read aheads and Write Behinds
Processor Core
(GPP/DSP)
L1 Data Cache
Q Q
Crossbar
Main
Memory
FPGA
Instruction Cache
4/19/00 TI Meeting 38
Permutation Based Interleaved Memory (PBI)
High Memory Bandwidth Needed
Stride-Insensitive Memory System for Matrices
Multiple Banks
Sustained Peak Throughput (95%)
L1 Data Cache
Main
Memory
4/19/00 TI Meeting 39
Processor Core
64-bit EPIC Architecture with Extensions(IA-64/C6x)
Statically determined Parallelism;exploit ILP
Execution Time Predictability
Processor Core
(GPP/DSP)
Cache
Q Q
Crossbar
FPGA
4/19/00 TI Meeting 40
EPIC Principle
Explicitly Parallel Instruction Computing
Evolution of VLIW Computing
Compiler- Key role
Architecture to assist Compiler
Better cope with dynamic factors
– which limited VLIW Parallelism
4/19/00 TI Meeting 41
Instruction Set Extensions
To accelerate Bit level computations in Wireless
Real/Complex Integer - Bit Multiplications
– Used in Multiuser Detection, Decoding
Bit - Bit Multiplications
– Used in Outer Product Updates
– Correlation, Channel Estimation
Complex Integer-Integer Multiplications
Useful in other Signal Processing applications
– Speech, Video,,,
4/19/00 TI Meeting 42
Architecture Support
Support via Instruction Set Extensions
Minimal ALU Modifications necessary
Transparent to Register Files/Memory
Additional 8-bit Special Purpose Registers
4/19/00 TI Meeting 43
Integer - Bit Multiplications
64-bit Register A 64-bit Register C
+/- +/- +/-
64-bit Register D
D = D + b*CEg: Cross-Correlation
8-bit Register b
Register Renaming?
4/19/00 TI Meeting 44
8-bit to 64-bit conversions
D = D + b*bT
Eg: Auto-Correlation
b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8)
b(1)..b(8) b(1) b(1) b(8)
b(1)..b(8) b(1) b(2) b(8)b(7)
b(8)
8-bit Register b 64-bit Register A
1.1 1.2
2.1
4/19/00 TI Meeting 45
Bit-Bit Multiplications
D = D + b*bT
Eg: Auto-Correlation
64-bit Register A = b1 64-bit Register B=b2
Ex-NOR
b1*b2Bit-Bit Multiplications
64-bit Register C=b1*b2
B1 B2 B1*B2
0 0 10 1 01 0 01 1 1
4/19/00 TI Meeting 46
Increment/Decrement
64-bit Register D
+/- +/- +/-
64-bit Register (D+b1*b2)
8-bit Register b1*b2
1
D = D + b*bT
Eg: Auto-Correlation
4/19/00 TI Meeting 47
Complex-valued Data Processing
Is it easy to add ?
Is this worth an additional ALU Support ?
Typically supported by Software!
?
4/19/00 TI Meeting 48
Truncated Multipliers
Many applications need approximate computations
Adaptive Algorithms :Y = Y + mu*(Y*C)
Truncate lower bits
Truncated Multipliers - half the area/half the delay
Can do 2 truncated multiplies in parallel with regular
Multiplier 1 Multiplier 2Truncated
Multiplier
ALU Multipliers
4/19/00 TI Meeting 49
Software Support
Greater Interaction between Compilers and Architectures– EPIC
– Reconfigurable Logic
Compiler needs to find and exploit bit level computations
Reconfigurable Logic Programming
4/19/00 TI Meeting 50
Other Uses
Reconfigurable Logic– For accelerating loops of general purpose processors
Bit Level Support– For other voice, video and multimedia applications
4/19/00 TI Meeting 51
Software Suggestions
Limited OS Support
Compiler Efficiency – No more Assembly!
Performance Analysis Tools
Code Composer Studio 1.2
4/19/00 TI Meeting 52
Conclusions
DSPs to play major role in Future Base-Station
Search for Computationally Efficient Algorithms and Better
Processor Designs to meet Real-Time
Reduced Complexity Algorithms designed
Processor Core with Reconfigurable Support developed
Extra Slides
4/19/00 TI Meeting 54
PBI Scheme
N- address length
M = 2n Banks
2N-n words in each bank
To access a word,
– n-bit bank number
– N-n bit address (high-order)
Calculation of the n-bit Bank Number
4/19/00 TI Meeting 55
Calculate Bank Number
Use all N bits to get n-bit vector Y = A X , A = n*N matrix of 0’s & 1’s Y = AhXh + Al Xl (N-n,n) [Al -rank n] N-bit parity circuit with logkN levels of XOR gates (k-Fanin)
Parity Ckt.
Row 0 of A
Parity Ckt.
Row 1 of A
Parity Ckt.
Row n-1 of A
N-bit address
Decoder
n parity bit signals
2n bank select signals
4/19/00 TI Meeting 56
Interleaved Memory Model
Address Source
M(0) M(1) M(M-1)
Data Sink Data Sequencer
Input Buffers
Output Buffers
Memory Banks
4/19/00 TI Meeting 57
Aspects of EPIC
Designing Plan of Execution(POE) at Compile Time
Permitting Compiler to play Statistics– Conditional Branches, Memory references
Communicating POE to the hardware– Static Scheduling
– Branch information
4/19/00 TI Meeting 58
Architecture Features in EPIC
Static Scheduling– MultiOP
– Non-Unit Assumed Latency (NUAL)
The Branch Problem– Predicated Execution
– Control Speculation
– Predicated Code Motion
The Memory Problem– Cache Specifiers
– Data Speculation
4/19/00 TI Meeting 59
Operation of Reconfigurable Logic
Load Configuration
– If in configuration cache, minimal time
Copy initial data with coprocessor move instructions
Start execution
Issue wait that interlocks while active
Copy registers back at kernel completion
Recommended