Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University

Frank Vahid, UC Riverside

1

Self-Improving Configurable IC Platforms

Frank VahidAssociate Professor

Dept. of Computer Science and EngineeringUniversity of California, Riverside

Also with the Center for Embedded Computer Systems at UC Irvine

http://www.cs.ucr.edu/~vahidCo-PI: Walid Najjar, Professor, CS&E, UCR

Frank Vahid, UC Riverside 2

Goal: Platform Self-Tunes to Executing Application

Download standard binary Platform adjusts to executing application Result is better speed and energy Why and How?

0

10

20

30

40

50

60

70

80

90

100

Execution timeEnergy

0

10

20

30

40

50

60

70

80

90

100


Application

Platform


Platforms

Pre-designed programmable platforms Reduce NRE cost, time-to-market, and risk Platform designer amortizes design cost

over large volumes Many (if not most) will include FPGA

Today: Triscend, Altera, Xilinx, Atmel More sure to come

As FPGA vendors license to SoC makers

FPGA

MemProcessor

L1Cach

e

Periph1

JPEG

Sample Platform

Processor, cache, memory, FPGA, etc.

0

10

20

30

40

50

60

70

1 2 3 4

Volume

Cost

per

IC

199020002010Mainstream

design

Modern IC costs are feasible mostly in very

high volumes


Hardware/Software Partitioning Improves Speed and Energy

FPGA

MemProcessor L1

Cache

Periph1

JPEG

But requires partitioning CAD tool O.K. in some flows In mainstream

software flows, hard to integrate

Standard Sw

Tools

0

10

20

30

40

50

60

70

80

90

100


Hw/Sw Parti-tioner

idleuP active

idleuP FPGA


Idea: Perform Partitioning Dynamically (and hence Transparently)

Add components on-chip:

Profile Decompile frequent loops Optimize Synthesize Place and route onto FPGA Update Sw to call FPGA

Transparent No impact on tool flow Dynamic software

optimization, software binary updating, and dynamic binary translation are proven technologies

But how can you profile, decompile, optimize, synthesize, and p&r, on-chip?

DAG & LC

MemProcessor

L1Cache

Profiler

Explorer

Dynamic Partitioning

ModuleDecompiler, Optimizer

Synthesis, Place and

Route

FPGA


Dynamic Partitioning Requires Lean Tools

How can you run Synopsys/Cadence/Xilinx tools on-chip, when they currently run on powerful workstations?

Key – our tools only need be good enough to speedup critical loops

Most time spent in small loops (e.g., Mediabench, Netbench, EEMBC) Created ultra-lean versions of the tools

Quality not necessarily as good, but good enough Runs on a 60 MHz ARM 7

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 10

% execution time

% size of program

Loop


Dynamic Hw/Sw Partitioning Tool Chain

DAG & LC

FPGA

MemProcessor

L1Cach

eProfiler

Explorer

Partitioner

Binary

Loop Profiler

Small, Frequent Loops

Loop Decompilatio

n

Place & Route

Hw

Synthesis

Binary Modification

Updated Binary

DMA Configuration

Bitfile Creation

Tech. Mapping

Architecture targeted for loop speedup, simple P&R

We’ve developed efficient profiler Hw

We’re continuing to extend these tools to handle more benchmarks

Decompiler, Optimizer


Route


Dynamic Hw/Sw Partitioning Results

DAG & LC

FPGA

MemProcessor

L1Cach

eProfiler

Explorer

Partitioner

UCR Tools

Code Size

(lines)Memory (bytes)

Avg. Time

(s)

Binary Size

(bytes)

Decompilation

FPGA Config.

RT Synthesis

Logic Min.

Tech. Mapping

Place&Route

4,695 360K 1.60 47K

7,203 452K 0.20 67K

Decompiler, Optimizer


Route


Dynamic Hw/Sw Partitioning Results

Example Sw TimeSw Loop

TimeHw Loop

TimeSw/Hw Time S

brev 0.07 0.05 0.001 0.02 3.1

g3fax1 33.84 10.58 1.19 24.45 1.4

g3fax2 33.84 10.64 2.15 25.35 1.3

url 547.06 437.39 19.13 128.80 4.2

logm in 23.50 15.00 0.31 8.81 2.7

pktflow 1.19 0.42 0.09 0.86 1.4

canrdr 1.18 0.41 0.07 0.84 1.4

bitm np 6.98 3.75 0.04 3.27 2.1

Avg: 59.78 2.87 24.05 2.2

Powerstone, NetBench, and EEMBC examples, most frequent 1 loop only Average speedup very close to ideal speedup of 2.4

Not much left on the table in these examples Dynamically speeding up inners loops on FPGAs is feasible using on-chip tools ICCAD’02 (Stitt/Vahid) – Binary-level partitioning in general is very effective


Configurable Cache: Why?

ARM920T: Caches consume half of total processor system power (Segars 01)

M*CORE: Unified cache consumes half of total processor sys. power (Lee/Moyer/Arends 99)

DAG & LC

FPGA

MemProcessor

L1Cache

Profiler

Explorer

Dynamic Partitioning

ModuleDecompiler

Synthesis

Place and Route


Best Cache for Embedded Systems?

Diversity of associativity, line size, total size

Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line

AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32

ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32

Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A

IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32

IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64

Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32

Instruct. Cache Data Cache Instruct. Cache Data Cache


Cache Design Dilemmas Associativity

Low: low power, good performance for many programs

High: better performance on more programs Total size

Small: lower power if working set small, (less area) Big: better performance/power if working set large

Line size Small: better when poor spatial locality Big: better when good spatial locality

Most caches are a compromise for many programs

Work best on average But embedded systems run one/few programs

Want best cache for that one program

vs.

vs.

vs.


Solution to the Cache Design Dilemna

Configurable cache Design physical cache that can be

reconfigured 1-way, 2-ways, or 4-ways

Way concatenation – new technique, ISCA’03 (Zhang/Vahid/Najjar)

Four 2K ways, plus concatenation logic 8K, 4K or 2K byte total size

Way shutdown, ISCA’03 Gates Vdd, saves both dynamic and static

power, some performance overhead (5%) 16, 32 or 64 byte line size

Variable line fetch size, ISVLSI’03 Physical 16 byte line, one, two or four

physical line fetches Note: this is a single physical cache, not a

synthesizable core


Configurable Cache Design: Way Concatenation (4, 2 or 1 way)

index

c1 c3c0 c2

a11

a12

reg1

reg0

sense ampscolumn mux

tag part

tag address

mux driver

c1

line offset

data output

critical path

c0

c2

c0 c1

6x64

6x64

c3c2

6x64

6x64

c3

6x64

6x64

a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0

Configuration circuit

data array

bitline

Trivial area overhead, no performance overhead


Configurable Cache Design Metrics

We computed power, performance, energy and size using CACTI models Our own layout (0.13 TSMC CMOS), Cadence tools Energy: considered cache, memory, bus, and CPU stall

Powerstone, MediaBench, and SPEC benchmarks Used SimpleScalar for simulations


Configurable Cache Energy Benefits

40%-50% energy savings on average Compared to conventional 4-way and 1-way assoc., 32-byte line size AND, best for every example (remember, conventional is compromise)

126.1%619.6%126.8%

0%

20%

40%

60%

80%

100%

120%

padp

cm crc

auto

2

bcnt bilv

bina

ry blit

brev

g3fa

x fir

pjep

g

ucbq

sort

v42

adpc

m

epic

g721

pegw

it

mpe

g

jpeg ar

t

mcf

pars

er vpr

Ave

cnv4w32 cnv1w32 con4


Future Work Dynamic cache tuning More advanced dynamic partitioning

Automatic frequent loop detection On-chip exploration tool Better decompilation, synthesis Better FPGA fabric, place and route Approach: continue to extend to support more

benchmarks Extend to platforms with multiple processors

Scales well – processors can share on-chip partitioning tools


Conclusions

Self-improving configurable ICs Provide excellent speed and energy improvements Require no modification to existing software flows

Can thus be widely adopted

We’ve shown the idea is practical Lean on-chip tools are possible Now need to make them even better Extensive research into algorithms, designs and

architecture is needed

Documents

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University