View
213
Download
0
Tags:
Embed Size (px)
Citation preview
JR.S00 1
Lecture 14: (Re)configurable Computing
Case Studies
Prof. Jan Rabaey
Computer Science 252, Spring 2000
The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics) to this slide set are gratefully acknowledged
JR.S00 2
Summary of Previous Class
• Configurable Computing using “programming in space” versus “programming in time” for traditional instruction-set computers
• Key design choices– Computational units and their granularity
– Interconnect Network
– (Re)configuration time and frequency
• Next class: Some practical examples of reconfigurable computers
JR.S00 3
Applicability of Configurable Processing
• Stand-alone computational engines– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded processors
– E.g. Napa 2000
• Augmenting the instruction set of processors– E.g. GARP, RAW
• Providing programmable accelerator co-processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 4
Stand-Alone ComputationalEnginesUCLA Mojave System
Template Matching for AutomaticTarget Recognition
I960 Board
JR.S00 5
As Programmable Interface and I/O Processor
• Logic used in place of – ASIC environment
customization
– external FPGA/PLD devices
• Example– bus protocols
– peripherals
– sensors, actuators
• Case for:– Always have some system
adaptation to do – varying glue logic requirements
– Modern chips have capacity to hold processor + glue logic
– Reduces part count
– Valued added must now be accommodated on chip (formerly board level)
JR.S00 6
Example: Interface/Peripherals
• Triscend E5
JR.S00 7
Model: IO Processor
• Array dedicated to servicing IO channel
– sensor, lan, wan, peripheral
• Provides– protocol handling
– stream computation
» compression, encrypt
• Looks like IO peripheral to processor
• Case for:– many protocols, services
– only need few at a time
– dedicate attention, offload processor
JR.S00 8
IO Processing
• Single threaded processor– cannot continuously monitor multiple data pipes (src,
sink)
– need some minimal, local control to handle events
– for performance or real-time guarantees , may need to service event rapidly
– E.g. checksum (decode) and acknowledge packet
JR.S00 9
Source: National Semiconductor
NAPA 1000 Block Diagram
RPCReconfigurablePipeline Cntr
ALPAdaptive Logic
Processor
SystemPort
TBTToggleBusTM
Transceiver
PMAPipeline
Memory Array
CR32CompactRISCTM
32 Bit Processor
BIUBus Interface
Unit
CR32PeripheralDevices
ExternalMemoryInterface SMA
ScratchpadMemory Array
CIOConfigurable
I/O
JR.S00 10
Source: National Semiconductor
NAPA 1000 as IO Processor
SYSTEMHOST
NAPA1000
ROM &DRAM
ApplicationSpecific
Sensors, Actuators, orother circuits
System Port
CIO
Memory Interface
JR.S00 11
I/O Stream Processor
Philips Nexperia NX-2700A programmable HDTVmedia processor
Combines Trimedia VLIW withConfigurable media co-processors
JR.S00 12
Model: Instruction Augmentation• Observation: Instruction Bandwidth
– Processor can only describe a small number of basic computations in a cycle
» I bits 2I operations
– This is a small fraction of the operations one could do even in terms of www Ops
» w22(2w) operations
– Processor could have to issue w2(2 (2w) -I) operations just to describe some computations
– An a priori selected base set of functions could be very bad for some applications
JR.S00 13
Instruction Augmentation
• Idea:– provide a way to augment the processor’s instruction set
– with operations needed by a particular application
– close semantic gap / avoid mismatch
• What’s required:– some way to fit augmented instructions into stream
– execution engine for augmented instructions
» if programmable, has own instructions
– interconnect to augmented instructions
JR.S00 14
“First” Instruction Augmentation
• PRISM– Processor Reconfiguration through Instruction Set
Metamorphosis
• PRISM-I– 68010 (10MHz) + XC3090
– can reconfigure FPGA in one second!
– 50-75 clocks for operations
[Athanas+Silverman: Brown]
JR.S00 15
PRISM-1 Results
Raw kernel speedups
JR.S00 16
PRISM
• FPGA on bus
• access as memory mapped peripheral
• explicit context management
• some software discipline for use
• …not much of an “architecture” presented to user
JR.S00 17
PRISC
• Takes next step– what look like if we put it on chip?
– how integrate into processor ISA?
• Architecture:– couple into register file as “superscalar” functional unit
– flow-through array (no state)
[Razdan+Smith: Harvard]
JR.S00 18
PRISC
• ISA Integration– add expfu instruction
– 11 bit address space for user-defined expfu instructions
– fault on pfu instruction mismatch
» trap code to service instruction miss
– all operations occur in clock cycle
– easily works with processor context switch
» no state + fault on mismatch pfu instr
JR.S00 19
PRISC Results
• All compiled
• working from MIPS binary
• <200 4LUTs ?– 64x3
• 200MHz MIPS base
Razdan/Micro27
JR.S00 20
Chimaera
• Start from PRISC idea– integrate as functional unit
– no state
– RFUOPs (like expfu)
– stall processor on instruction miss, reload
• Add– manage multiple instructions loaded
– more than 2 inputs possible
[Hauck: Northwestern]
JR.S00 21
Chimaera Architecture
• “Live” copy of register file values feed into array
• Each row of array may compute from register values or intermediates (other rows)
• Tag on array to indicate RFUOP
Results
• Compress 1.11
• Eqntott 1.8
• Life 2.06 (160 hand parallelization)
[Hauck/FCCM97]
JR.S00 22
Instruction Augmentation
• Small arrays with limited state– so far, for automatic compilation
» reported speedups have been small
– open
» discover less-local recodings which extract greater benefit
JR.S00 23
GARP
Identified Problems:
• Single-cycle flow-through – not most promising usage style
• Moving data through Register File to/from array
– can present a limitation
» bottleneck to achieving high computation rate
[Hauser+Wawrzynek: UCB]
JR.S00 24
GARP
• Integrate as coprocessor– similar bandwidth to processor as FU
– own access to memory
• Support multi-cycle operation– allow state
– cycle counter to track operation
• Fast operation selection– cache for configurations
– dense encodings, wide path to memory
JR.S00 25
GARP• ISA -- coprocessor operations
– issue gaconfig to make a particular configuration resident (may be active or cached)
– explicitly move data to/from array
» 2 writes, 1 read
– processor suspend during coprocessor operation
» cycle count tracks operation
– array may directly access memory
» processor and array share memory space• cache/mmu keeps consistency
» can exploit streaming data operations
JR.S00 26
GARP
• Processor Instructions
JR.S00 27
GARP Array
• Row oriented logic– denser for datapath
operations
• Dedicated path for – processor/memory data
• Processor not have to be involved in arraymemory path
JR.S00 28
GARP Results
• General results– 10-20x on stream, feed-
forward operation
– 2-3x when data-dependencies limit pipelining
[Hauser+Wawrzynek/FCCM97]
JR.S00 29
PRISC/Chimera … GARP
• PRISC/Chimaera– basic op is single cycle: expfu (rfuop)
– no state
– could conceivably have multiple PFUs?
– Discover parallelism => run in parallel?
– Can’t run deep pipelines
• GARP– basic op is multicycle
» gaconfig» mtga» mfga
– can have state/deep pipelining
– ? Multiple arrays viable?
– Identify mtga/mfga w/ corr gaconfig?
JR.S00 30
Common Theme
• To get around instruction expression limits– define new instruction in array
» many bits of config … broad expressability
» many parallel operators
– give array configuration short “name” which processor can callout
» …effectively the address of the operation
But – Impact of using reconfiguration at Instruction Level seems limited Explore opportunities at larger granularity levels (basic block, task, process)
JR.S00 31
Applicability of Configurable Processing
• Stand-alone computational engines– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded processors
– E.g. Napa 2000
• Augmenting the instruction set of processors– E.g. GARP, RAW
• Providing programmable accelerator co-processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 32
Example: Chameleon Reconfigurable Co-Processor (network, communication applications)
Reconfigurable LogicArray of 32-bit Data Path
Operators & Control Logic
Data MemoryPCI
InterfaceMemory
ControllerJTAG
Debugging Port
ARC CPU Instruction Cache
Bus Manager& DMA
Controllers
Local StoreMemory(LSM)
Background Configuration Plane
Configuration bit stream
Wide Internal communications bus
Multiple banks of I/O
JR.S00 33
Reconfigurable Processor Tools Flow
Customer Application / IP
(C code)
ARC Object
Code
C Compiler
RTLHDL
Linker
Chameleon Executable
C Model Simulator
Configuration Bits
Synthesis & Layout
C DebuggerDevelopment
Board
JR.S00 34
Heterogeneous Reconfiguration
ReconfigurableReconfigurableLogicLogic
ReconfigurableReconfigurableDatapathsDatapaths
adder
buffer
reg0
reg1
muxCLB CLB
CLBCLB
DataMemory
InstructionDecoder
&Controller
DataMemory
ProgramMemory
Datapath
MAC
In
AddrGen
Memory
AddrGen
Memory
ReconfigurableReconfigurableArithmeticArithmetic
ReconfigurableReconfigurableControlControl
Bit-Level Operationse.g. encoding
Dedicated data pathse.g. Filters, AGU
Arithmetic kernelse.g. Convolution
RTOSProcess management
JR.S00 35
Multi-granularity Reconfigurable Architecture: The Berkeley Pleiades Architecture
Communication Network
ControlProcessor
ArithmeticProcessor
ArithmeticProcessor
ArithmeticProcessor
ConfigurableDatapath
ConfigurableLogic
Configuration Bus
Network Interface
DedicatedArithmetic
Configuration
Satellite ProcessorSatellite Processor
• Computational kernels are “spawned” to satellite processors• Control processor supports RTOS and reconfiguration• Order(s) of magnitude energy-reduction over traditional programmable architectures
JR.S00 36
Matching Computation and Architecture
AddressGen AddressGen
Memory Memory
MAC MAC
ControlProcessor
L CG
Convolution
Two models of computation:communicating processes + data-flow
Two architectural models:sequential control+ data-driven
JR.S00 37
Execution Model of a Data-Flow Kernel
for(i=1;i<=L;i++)
for(k=i;k<=L;k++)
phi[i][k]= phi[i-1][k-1]
+in[NP-i]*in[NP-k]
-in[NA-1-i]*in[NA-1-k];
endstart
Embedded processor
AddrGen
MEM: in
ALU
ALU
AddrGen
MEM: phiMPY MPY
•Distributed control and memory
Code seg
Code seg
JR.S00 38
Reconfigurable Kernels for W-CDMA
• Dominant kernel M(MTX) requires array of MACs and segmented memories
• Additional operations such as sqrt(x), 1/x, and Trellis decoding may be implemented using FPGA or cordic satellite
JR.S00 39
Inter-Satellite Communication• Data-driven execution
– A satellite processor is enabled only when input data is ready
• Data sources generate data of different types: scalars, vectors, matrices
• Data computing processors handle data inputs of different types
end-of-vector token
MPY
MPY1
nn
MACn
n1
AddrGen Memory
Embeddedprocessor
1
11
Data sources Data computing processors
JR.S00 40
Impact of Architectural Choice
1870
Stron
gAR
M
131
Nor
mal
ized
Ene
rgy
/ st
age
[nJ]
TM
S32
0C2x
x
Energy/stage
49
TM
S32
0LC
54x
1000
100
10000
10
21u
Stron
gAR
M
10uN
orm
aliz
ed D
elay
/sta
ge [s
]
TM
S320
C2x
x
Delay/stage
3.8u
TM
S32
0LC
54x
10u
1u
100u
100n
18.5
TM
S320
LC
54x
Nor
mal
ized
Ene
rgy*
Del
ay /
stag
e [J
s*e-
14]
10
1
100
1000 Energy*Delay/stage
137
TM
S32
0C2x
x0.1
3970
Stron
gAR
M
10000Example: 16 point Complex Radix-2 FFT (Final Stage)
13
570n 0.75
Ple
iade
s
Ple
iade
s
Ple
iade
s
JR.S00 41
Adaptive Multi-User Detector for W-CDMAPilot Correlator Unit Using LMS
AG
MULSUB
ADDMEM
MEM
MEM
MEMAG
MUL
MUL
MUL
Filter
Coefficient Update
MEM
MEMAG
ACC
ACC
MAC
MAC
MUL
MUL
SUB
SUB
MULSUB
ADD
MUL
MUL
MUL
SUB
SUB
alt
alt
alt
alt
alt
alt
alt
s_r
s_i
y_r
y_iADD
ADD
Zmf_r
Zmf_i
s_r
s_iZmf_r
Zmf_i
y_r
y_i
JR.S00 42
Architecture Comparison
LMS Correlator at 1.67 MSymbols Data RateLMS Correlator at 1.67 MSymbols Data RateComplexity: 300 Mmult/sec and 357 Macc/secComplexity: 300 Mmult/sec and 357 Macc/sec
Note: TMS implementation requires 36 parallel processors to meet data rate - validity questionable
16 Mmacs/mW!
JR.S00 43
Maia: Reconfigurable Baseband Processor for Wireless
• 0.25um tech: 4.5mm x 6mm
• 1.2 Million transistors
• 40 MHz at 1V
• 1 mW VCELP voice coder
• Hardware
• 1 ARM-8
• 8 SRAMs & 8 AGPs
• 2 MACs
• 2 ALUs
• 2 In-Ports and 2 Out-Ports
• 14x8 FPGA
JR.S00 44
Reconfigurable Interconnect Exploration
N Inputs
B Buses
M Outputs
Multi-Bus
cluster
cluster
cluster
Hierarchical MeshMesh
Module
JR.S00 45
Software Methodology Flow
Algorithms
Kernel Detection
Estimation/Exploration
Partitioning
Software CompilationReconfig. Hardware Mapping
Interface Code Generation
Power & Timing Estimation
of Various Kernel Implementations
PDA Models
PremappedKernels
Acceleratorproc &
BehavioralDemonstration
C++ Module Libraries
C++
SUIF+ C-IF
JR.S00 46
Hardware-Software Exploration
Macromodel call
JR.S00 47
Implementation Fabrics for Protocols
BU
FMemory
Slot_Set_Tbl2x16
addr
BU
F
slot_set<31:0>
Slot_no<5:0>
Slotstart
Pktend
RACHreq
RACHakn
W_ENA
R_ENAupdate
idle
writereadslotset
RACH
idle
A protocol = Extended FSM
Intercom TDMA MAC
JR.S00 48
Intercom TDMA MACImplementation alternatives
• ASIC: 1V, 0.25 m CMOS process
• FPGA: 1.5 V 0.25 m CMOS low-energy FPGA
• ARM8: 1 V 25 MHz processor; n = 13,000
• Ratio: 1 - 8 - >> 400
ASIC FPGA ARM8
Power 0.26mW 2.1mW 114mWEnergy 10.2pJ/op 81.4pJ/op n*457pJ/op
JR.S00 49
The Software-Defined Radio
ReconfigurableDataPath
ReconfigurableDataPath
FPGAFPGA Embedded uPEmbedded uP
Dedicated FSMDedicated FSM
DedicatedDSP
DedicatedDSP
JR.S00 50
An Industrial Example:Basestation for Cellular Wireless
800 MHz 1900 MHz
A B A B A D B CFE
RF/IFTuner
Block-SpectrumA/D
RF/IFTuner
Block-SpectrumA/D
RF/IFTuner
Block-SpectrumA/D
AntennaSystem
multiple sectorsmulti-band
RF/IFTuner
Block-SpectrumA/D
RF/IFTuner
Block-SpectrumA/D
RF/IFTuner
Block-SpectrumA/D
AntennaSystem
multiple sectorsmulti-band
HIGH-SPEED DIGITAL BUS
Modular/Parameterizable•per carrier•per TDMA time-slot•per CDMA code
T1 / E1
...
JR.S00 51
BTS Signal Processing Platforms
Hig
h-S
peed D
igit
al B
us
A/D
Con
vers
ion
CommAgent
CommAgentD
/A C
on
vers
ion Standard A, F1
N DSP/CPUs
CPU
Standard A, F2
Standard B, F1...
...
Hardwired ASICs
JR.S00 52
Basestation of the Next Generation
WidebandRF
Data Networks
10/100or
Gbit
ATM
JR.S00 53
Coexistence of Multiple Standards In Product Supply Chain
2.5G– GPRS
– HCSD
– IS-95 MDR
– IS-95 HDR
– IS-136 HS
3G– ETSI UTRA
– ARIB W-CDMA
– TIA cdma2000
– W-TDMA (UWC)
2G– GSM
– DCS1800
– PCS1900
– IS-95
– IS-54B
– IS-136
– PDC
CIRCUIT PACKET VOICE DATANARROWBAND WIDEBAND
JR.S00 54
Wideband CDMA: MOPS? No, GOPS!
Function MIPSDigital RRC Channel 3600Searcher 2100RAKE 1050Maximal Ratio Combiner 24Channel Estimator 12AGC, AFC 10Deinterleaver 15Turbo Coder 90TOTAL 6901
Single 384 kbps ARIB W-CDMA Channel
Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE
Communication Theory Workshop, May 1999, Aptos, USA.
JR.S00 55
The common approach to hardware design involves:
multiple ASIC’s to support each standard.
• Hardwired implementation is not scalable or upgradeable to new standards.• This approach costs time in a time-to-market dominated world.• Creating new chipsets for every technology combination critically challenges available design resources!
HW Multistandard Solutions
DigitalHardwired
ASICIF RF
DigitalHardwired
ASICIF RF
DigitalHardwired
ASICIF RF
AnalogProgrammable Unique
Combinations
DSP
Control Processor
JR.S00 56
SW Multistandard SolutionApplying instruction-set processor architectures to
all baseband processing would be desireable...
…but is simply not an good implementation for base stations:-Unacceptably high cost per channel
-Unacceptably large power per channel
This is definitely not a viable implementation for terminals
AnalogProgrammable
IF RF
IF RF
IF RF
DSP
Control Processor
JR.S00 57
The Law of Diminishing Returns• More transistors are being thrown at improving
general-purpose CPU and DSP performance
• Fundamental bounds are being pushed– limits on instruction-level parallelism
– limits on memory system performance
• Returns per transistor are diminishing– new architectures realizing only 2-3 instructions/clock
– increasingly large caches to hide DRAM latency
JR.S00 58
Embedded Logic
Evolution– Increasing fixed-function hardwired content of systems
– Core+Logic becomes de-facto design architecture
– Move to deep sub-micron technology
» rapidly increasing product integration cycles
» increasingly constrained design resources
» sharp increases in cost of “trying out” an idea- NRE– Design methodologies optimized for random logic, homogeneous architectures,
and lower speed signal processing (I.e. control-flow dominated systems)
– Verification issues dominate design cycle time
Growing Design Cycle Times At Odds With Shrinking Product Cycle Times
JR.S00 59
FPGA the Solution?Cellular Handset Using Current FPGA
JR.S00 60
Some Interesting Observations
Don’t use more transistors to stretch general-purpose performance, whether for CPUs, DSPs, or
reconfigurable logic.
Don’t use more time to designdedicated hardwired solutions in cases where
mass customizationis what the market demands.
JR.S00 61
View The “Reconfigurability” Problem From The System Level
What Are the Application-Specific Performance Needs?
– What are the applications targeted?
– What algorithms are essential to achieving the performance goals?
– What are the functions at the heart of these algorithms?
– Which functions yield poor price-performance with general-purpose MOPS and system memory models?
– What is the embedded systems programmer’s model?
– Best performance at what cost:
» Area - instruction-level parallelism, memory hierarchy
» Power- energy requirements on a function basis
» Time- quality and ease of programming for app development
» Pain- forward opportunity vs backward compatibility
JR.S00 62
Define optimal architecture for efficient implementation
Successfully Using Reconfigurability
Application-Specific Leverage
Focus on first on applications and constituent algorithms, not the silicon architecture !Wireless Communications Transceiver Signal Processing
Minimize the hardware reconfigurability to constrained set
Maximize the software parameterizability and ease of use of the programmer’s model for flexibility
JR.S00 63
Application-Specific MOPS in Digital Communications
RF/IF
TDMA Wideband Signal
Processing Engine
CDMAWideband Signal
Processing Engine
Wideband ChannelDecoder Engine
Programmable DSP
Microprocessor
DigitalDownconversion
and Channelization
JR.S00 64
Morphics’ DRL Architecture
Heterogeneous Multiprocessing Engine Using Application-Specific Reconfigurable Logic
m
R
m
O
Clk
Enableoutput
input
input
m
R
m
O
Clk
Enableoutput
input
input
m
R
m
O
Clk
Enableoutput
input
input
Small Granularity Kernel
Large Granularity Kernel
DA
TA
FL
OW
JR.S00 65
DRL Kernels
DATA SEQUENCER
DATA MEMORY
PARAMETERIZABLECONFIGURABLE
ALU
JR.S00 66
Mapping Software to Target Architecture
DATA SEQUENCER
DATA MEMORY
PARAMETERIZABLECONFIGURABLE
ALU
DATA SEQUENCER
DATA MEMORY
PARAMETERIZABLECONFIGURABLE
ALU
JR.S00 67
Programmer’s Guide
Document provided with each processor to enable application development, configuration, and system control via host processor.
Includes complete • description of each API function
call
• system control functions
• variables, parameters
• coding examples, and performance realized (i.e. ROC curves)
/* morphics soft API usage examples: *//* search a pilot set in search set *//* maintainance mode. */
...set_ptr = get_next_set();if (no_need_to_throttle()){ search_set(set_ptr);}...void search_set(PILOT_TYPE *pilot){ /* search threshold is assumed to be set in other places */ morphics_searcher_set_win_size(pilot->win_size); morphics_searcher_set_pn(pilot->pn); morphics_searcher_set_int_len(pilot->int_len);}
/* finger re-assignments */...fing[i].pos = morphics_demod_get_fin_pos(i);...distance = calc_fing_movement(&fing[i], new_pn);...fing[i].slew = distance;fing[i].pn = new_pn;morphics_demod_set_fing_iq(i, fing[i].pn);while ( slew_not_done(morphics_demod_get_status()) ){ morphics_demod_set_fing_slew(i, fing[i].slew);}...
JR.S00 68
Key Pieces of Design Methodology
System-level Profiling• Analyze sequences of operations (arithmetic, memory access, etc)
• Analyze communication bottlenecks• Key flexible parameters (algorithm v architecture parameters)
Architecture-level Profiling• ALU/kernel definition (sequences of operators)• Memory profile• Type of configurability required for flexibility• Macro-sequencer development
Implementation• SW- programmer’s model developed at architecture specification stage• SW- API proven out via behavioral models & demonstrator hardware• VLSI-focus on regular predictable timing and routability• VLSI- embedded reconfigurability in an ASIC flow
JR.S00 69
CDMA Modem Analysis
JR.S00 70
MOPS Breakdown Analysis
JR.S00 71
Prototype Demonstrator
RISC Microprocessor
Configurable Kernel(s)
Data Router
JR.S00 72
Summary
• Configurable computing is finding its way into the embedded processor space
• Best suited (so far) for – Flexible I/O and Interface functionality
– providing task-level acceleration of “parametizable” functions
• Improvement of IP seems limited
• Software flow still subject to improvement
• Might become more interesting with the emergence of low-current devices (TFT, organic transistors, molecular computing)
DO NOT FORGET CONFIGURATION OVERHEAD