View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Frank Vahid, UC Riverside
1
Self-Improving Configurable IC Platforms
Frank VahidAssociate Professor
Dept. of Computer Science and EngineeringUniversity of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahidCo-PI: Walid Najjar, Professor, CS&E, UCR
Frank Vahid, UC Riverside 2
Goal: Platform Self-Tunes to Executing Application
Download standard binary Platform adjusts to executing application Result is better speed and energy Why and How?
0
10
20
30
40
50
60
70
80
90
100
Execution timeEnergy
0
10
20
30
40
50
60
70
80
90
100
Execution timeEnergy
Application
Platform
Frank Vahid, UC Riverside 3
Platforms
Pre-designed programmable platforms Reduce NRE cost, time-to-market, and risk Platform designer amortizes design cost
over large volumes Many (if not most) will include FPGA
Today: Triscend, Altera, Xilinx, Atmel More sure to come
As FPGA vendors license to SoC makers
FPGA
MemProcessor
L1Cach
e
Periph1
JPEG
Sample Platform
Processor, cache, memory, FPGA, etc.
0
10
20
30
40
50
60
70
1 2 3 4
Volume
Cost
per
IC
199020002010Mainstream
design
Modern IC costs are feasible mostly in very
high volumes
Frank Vahid, UC Riverside 4
Hardware/Software Partitioning Improves Speed and Energy
FPGA
MemProcessor L1
Cache
Periph1
JPEG
But requires partitioning CAD tool O.K. in some flows In mainstream
software flows, hard to integrate
Standard Sw
Tools
0
10
20
30
40
50
60
70
80
90
100
Execution timeEnergy
Hw/Sw Parti-tioner
idleuP active
idleuP FPGA
Frank Vahid, UC Riverside 5
Idea: Perform Partitioning Dynamically (and hence Transparently)
Add components on-chip:
Profile Decompile frequent loops Optimize Synthesize Place and route onto FPGA Update Sw to call FPGA
Transparent No impact on tool flow Dynamic software
optimization, software binary updating, and dynamic binary translation are proven technologies
But how can you profile, decompile, optimize, synthesize, and p&r, on-chip?
DAG & LC
MemProcessor
L1Cache
Profiler
Explorer
Dynamic Partitioning
ModuleDecompiler, Optimizer
Synthesis, Place and
Route
FPGA
Frank Vahid, UC Riverside 6
Dynamic Partitioning Requires Lean Tools
How can you run Synopsys/Cadence/Xilinx tools on-chip, when they currently run on powerful workstations?
Key – our tools only need be good enough to speedup critical loops
Most time spent in small loops (e.g., Mediabench, Netbench, EEMBC) Created ultra-lean versions of the tools
Quality not necessarily as good, but good enough Runs on a 60 MHz ARM 7
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 10
% execution time
% size of program
Loop
Frank Vahid, UC Riverside 7
Dynamic Hw/Sw Partitioning Tool Chain
DAG & LC
FPGA
MemProcessor
L1Cach
eProfiler
Explorer
Partitioner
Binary
Loop Profiler
Small, Frequent Loops
Loop Decompilatio
n
Place & Route
Hw
Synthesis
Binary Modification
Updated Binary
DMA Configuration
Bitfile Creation
Tech. Mapping
Architecture targeted for loop speedup, simple P&R
We’ve developed efficient profiler Hw
We’re continuing to extend these tools to handle more benchmarks
Decompiler, Optimizer
Synthesis, Place and
Route
Frank Vahid, UC Riverside 8
Dynamic Hw/Sw Partitioning Results
DAG & LC
FPGA
MemProcessor
L1Cach
eProfiler
Explorer
Partitioner
UCR Tools
Code Size
(lines)Memory (bytes)
Avg. Time
(s)
Binary Size
(bytes)
Decompilation
FPGA Config.
RT Synthesis
Logic Min.
Tech. Mapping
Place&Route
4,695 360K 1.60 47K
7,203 452K 0.20 67K
Decompiler, Optimizer
Synthesis, Place and
Route
Frank Vahid, UC Riverside 9
Dynamic Hw/Sw Partitioning Results
Example Sw TimeSw Loop
TimeHw Loop
TimeSw/Hw Time S
brev 0.07 0.05 0.001 0.02 3.1
g3fax1 33.84 10.58 1.19 24.45 1.4
g3fax2 33.84 10.64 2.15 25.35 1.3
url 547.06 437.39 19.13 128.80 4.2
logm in 23.50 15.00 0.31 8.81 2.7
pktflow 1.19 0.42 0.09 0.86 1.4
canrdr 1.18 0.41 0.07 0.84 1.4
bitm np 6.98 3.75 0.04 3.27 2.1
Avg: 59.78 2.87 24.05 2.2
Powerstone, NetBench, and EEMBC examples, most frequent 1 loop only Average speedup very close to ideal speedup of 2.4
Not much left on the table in these examples Dynamically speeding up inners loops on FPGAs is feasible using on-chip tools ICCAD’02 (Stitt/Vahid) – Binary-level partitioning in general is very effective
Frank Vahid, UC Riverside 10
Configurable Cache: Why?
ARM920T: Caches consume half of total processor system power (Segars 01)
M*CORE: Unified cache consumes half of total processor sys. power (Lee/Moyer/Arends 99)
DAG & LC
FPGA
MemProcessor
L1Cache
Profiler
Explorer
Dynamic Partitioning
ModuleDecompiler
Synthesis
Place and Route
Frank Vahid, UC Riverside 11
Best Cache for Embedded Systems?
Diversity of associativity, line size, total size
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32
Instruct. Cache Data Cache Instruct. Cache Data Cache
Frank Vahid, UC Riverside 12
Cache Design Dilemmas Associativity
Low: low power, good performance for many programs
High: better performance on more programs Total size
Small: lower power if working set small, (less area) Big: better performance/power if working set large
Line size Small: better when poor spatial locality Big: better when good spatial locality
Most caches are a compromise for many programs
Work best on average But embedded systems run one/few programs
Want best cache for that one program
vs.
vs.
vs.
Frank Vahid, UC Riverside 13
Solution to the Cache Design Dilemna
Configurable cache Design physical cache that can be
reconfigured 1-way, 2-ways, or 4-ways
Way concatenation – new technique, ISCA’03 (Zhang/Vahid/Najjar)
Four 2K ways, plus concatenation logic 8K, 4K or 2K byte total size
Way shutdown, ISCA’03 Gates Vdd, saves both dynamic and static
power, some performance overhead (5%) 16, 32 or 64 byte line size
Variable line fetch size, ISVLSI’03 Physical 16 byte line, one, two or four
physical line fetches Note: this is a single physical cache, not a
synthesizable core
Frank Vahid, UC Riverside 14
Configurable Cache Design: Way Concatenation (4, 2 or 1 way)
index
c1 c3c0 c2
a11
a12
reg1
reg0
sense ampscolumn mux
tag part
tag address
mux driver
c1
line offset
data output
critical path
c0
c2
c0 c1
6x64
6x64
c3c2
6x64
6x64
c3
6x64
6x64
a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
Configuration circuit
data array
bitline
Trivial area overhead, no performance overhead
Frank Vahid, UC Riverside 15
Configurable Cache Design Metrics
We computed power, performance, energy and size using CACTI models Our own layout (0.13 TSMC CMOS), Cadence tools Energy: considered cache, memory, bus, and CPU stall
Powerstone, MediaBench, and SPEC benchmarks Used SimpleScalar for simulations
Frank Vahid, UC Riverside 16
Configurable Cache Energy Benefits
40%-50% energy savings on average Compared to conventional 4-way and 1-way assoc., 32-byte line size AND, best for every example (remember, conventional is compromise)
126.1%619.6%126.8%
0%
20%
40%
60%
80%
100%
120%
padp
cm crc
auto
2
bcnt bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
g721
pegw
it
mpe
g
jpeg ar
t
mcf
pars
er vpr
Ave
cnv4w32 cnv1w32 con4
Frank Vahid, UC Riverside 17
Future Work Dynamic cache tuning More advanced dynamic partitioning
Automatic frequent loop detection On-chip exploration tool Better decompilation, synthesis Better FPGA fabric, place and route Approach: continue to extend to support more
benchmarks Extend to platforms with multiple processors
Scales well – processors can share on-chip partitioning tools
Frank Vahid, UC Riverside 18
Conclusions
Self-improving configurable ICs Provide excellent speed and energy improvements Require no modification to existing software flows
Can thus be widely adopted
We’ve shown the idea is practical Lean on-chip tools are possible Now need to make them even better Extensive research into algorithms, designs and
architecture is needed