37
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

  • View
    271

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside

1

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning

Frank VahidAssociate Professor

Dept. of Computer Science and EngineeringUniversity of California, Riverside

Also with the Center for Embedded Computer Systems at UC Irvine

http://www.cs.ucr.edu/~vahid

Page 2: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 2

Trend Towards Pre-Fabricated Platforms: ASSPs

ASSP: application specific standard product

Domain-specific pre-fabricated IC

e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC

Unique IC design Ignores quantity of same IC

ASIC design starts decreasing Due to strong benefits of

using pre-fabricated devices

Sourc

e:

Gart

ner/

Data

quest

Septe

mber’

01

Page 3: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 3

Will High End ICs Still be Made?

YES The point is that

mainstream designers likely won’t be making them

Very high volume or very high cost products

Platforms are one such product – high volume

Need to be highly configurable to adapt to different applications and constraints

0

10

20

30

40

50

60

70

1 2 3 4

Volume

Cost

per

IC 1990

20002010Mainstream

design

Becoming out of reach of

mainstream designers

Page 4: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 4

UCR Focus

Configurable Cache Hardware/Software Partitioning

Page 5: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 5

UCR Focus

Configurable Cache Hardware/Software Partitioning

Page 6: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 6

Configurable Cache: Why

uP

L1 cache

DSP

JPEG dcd

Periph-erals

FPGA

Pre-fabricated Platform

(A pre-designed system-level architecture)

IC ARM920T: Caches consume

half of total power (Segars 01)

M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99)

L1 cache

Page 7: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 7

Best Cache for Embedded Systems?

Not clear Huge variety among popular embedded processors

What’s the best… Associativity, Line size, Total size?

Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line

AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32

ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32

Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A

IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32

IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64

Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32

Instruct. Cache Data Cache Instruct. Cache Data Cache

Page 8: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 8

Cache Associativity

Direct mapped cache Certain bits “index”

into cache Remaining “tag” bits

compared

00 0 000

11 0 000

A

B

C

D

01 0 000

10 0 000 Conflict

0000DTag11

Direct mapped cache

(1-way set associative)

Index

Set associative cache Multiple “ways” Fewer index bits, more

tag bits, simultaneous comparisons

More expensive, but better hit rate

D110 C100

2-way set associative

cache

000

Page 9: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 9

Cache Associativity

Reduces miss rate – thus improving performance Impact on power and energy?

(Energy = Power * Time)

0.0%

0.5%

1.0%

1.5%

2.0%

1 2 4Associativity

Mis

s r

ate

epic

mpeg2

Page 10: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 10

Associativity is Costly

Associativity improves hit rate, but at the cost of more power per access

Are the power savings from reduced misses outweighed by the increased power per hit?

sa_data

wordline_databitline_data

decode_data

data output driver

mux driver

comparator

bitline_tag sa_tag

wordline_tag

decode_tag

Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1w ay 2w ay 4w ay

Associativity

En

erg

y p

er a

ccess(n

J)

Energy per access for 8 Kbyte cache

Page 11: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 11

Associativity and Energy

Best performing cache is not always lowest energy

0.0%

0.5%

1.0%

1.5%

2.0%

1 2 4Associativity

Mis

s ra

te

epic

mpeg2

0.0

0.2

0.4

0.6

0.8

1.0

1 2 4

AssociativityN

orm

aliz

ed e

nerg

y

epic

mpeg2

Significantly poorer energy

Page 12: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 12

Associativity Dilemma

Direct mapped cache Good hit rate on most examples

Low power per access But poor hit rate on some examples

High power due to many misses

Four-way set-associative cache Good hit rate on nearly all examples But high power per access

Overkill for most examples, thus wasting energy

Dilemma: Design for the average or worst case?

Page 13: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 13

Associativity Dilemma

Obviously not a clear choice Previous work

Albonesi – proposed configurable cache having way shutdown ability to save dynamic power

Motorola M*CORE also

Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line

AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32

ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32

Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A

IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32

IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64

Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32

Instruct. Cache Data Cache Instruct. Cache Data Cache

D1100

11 0 000

0000

Page 14: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 14

Our Solution: Way Concatenatable Cache

Can be configured as 4, 2, or 1 way Ways can be

concatenated

D11xx C10x

11 0 000

This bit selects the way

0000

Page 15: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 15

Configurable Cache Design: Way Concatenation (4, 2 or 1 way)

index

c1 c3c0 c2

a11

a12

reg1

reg0

sense ampscolumn mux

tag part

tag address

mux driver

c1

line offset

data output

critical path

c0

c2

c0 c1

6x64

6x64

c3c2

6x64

6x64

c3

6x64

6x64

a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0

Configuration circuit

data array

bitline

Small area and performance overhead

Page 16: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 16

Way Concatenate Experiments

Experiment Motorola PowerStone benchmark g3fax Considering dynamic power only

L1 access energy, CPU stall energy, memory access energy Way concatenate outperforms 4 way and direct map.

Just as good as way shutdown

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

Configuration

En

erg

y(n

J)

Page 17: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 17

Way Concatenate Experiments

Considered 23 programs (Powerstone, MediaBench, and Spec2000) Dynamic power only (L1 access energy, CPU stall energy, memory access energy)

Way concatenate Better than way shutdown (due to less performance penalty) Saves over conventional 4-way Also avoids big penalties of 1-way on some programs

100% = 4-way conventional cache 111%113% 289%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

padp

cm crc

auto

2

bcnt

bilv

bina

ry blit

brev

g3fa

x fir

pjep

g

ucbq

sort

v42

adpc

m

epic

jpeg

mpe

g2

pegw

it

g721 ar

t

mcf

pars

er vpr

Ave

rage

Benchmarks

Ene

rgy

(no

rmal

ize

d)

CnvI1D1

cnct

shut

both

Page 18: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 18

Way Concatenate Experiments

Best configuration varies Need to tune

configuration to a given program

Example Best Example Bestpadpcm I8KD8KI1D1 ucbqsort I4KD4KI1D1

crc I4KD4KI1D1 v42 I8KD8KI1D1

auto2 I8KD4KI1D1 adpcm I2KD8KI1D1

bcnt I8KD2KI1D1 epic I8KD8KI1D1

bilv I4KD4KI1D1 jpeg I8KD8KI4D2

binary I8KD2KI1D1 mpeg2 I8KD8KI1D2

blit I2KD8KI1D1 g721 I8KD8KI2D1

brev I8KD4KI1D2 art I4KD8KI1D1

g3fax I4KD4KI1D1 mcf I8KD8KI1D1

fir I8KD2KI1D1 parser I8KD8K41D1

pjepg I4KD8KI1D1 vpr I8KD8KI2D1

pegw it I8KD8KI1D1

Page 19: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 19

Normalized Execution Times

122% 245% 121%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

120%

padp

cm crc

auto

2

bcnt

bliv

bina

ry blit

brev

g3fa

x fir

pjep

g

ucbq

sort

v42

adpc

m

epic

jpeg

mpe

g2

pegw

it

g721 ar

t

mcf

pars

er vpr

Ave

rage

CnvI1D1

cncf

shut

both

Way shutdown suffers performance penalty As does direct mapped

Way concatenate has almost no performance penalty Though 3% longer critical path than conventional 4-

way

Page 20: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 20

Way Shutdown for Static Power Savings

Albonesi and Motorola used logic to gate clock Reduced dynamic power, but not static (leakage) Way concatenate clearly superior for reducing dynamic

pwr Shutting down ways still useful to save static power

But we’ll use another method (Agarwal DRG-cache)

Gnd

Vdd bitlinebitline

Gated-VddControl

SRAM cell

Page 21: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 21

Way Concatenate Plus Way Shutdown

We set static power = 30% of dynamic power Way shutdown now preferred in many

examples But way concatenate still very helpful

114%268%116%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

padp

cm crc

auto

2

bcnt

bilv

bina

ry blit

brev

g3fa

x fir

pjep

g

ucbq

sort

v42

adpc

m

epic

jpeg

mpe

g2

pegw

it

g721 ar

t

mcf

pars

er vpr

Ave

rage

Benchmarks

En

erg

y (n

orm

aliz

ed

)

CnvI1D1cnctshutboth

Page 22: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 22

Configurable Line Size Too

Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases

100% = 4-way conventional cache

127% 127%122%

126% 129%

119%

1.44E+00 147%230% 133%144%125%

0%

20%

40%

60%

80%

100%

120%

padp

cm crc

auto

2

bcnt

bilv

bina

ry blit

brev

g3fa

x fir

pjep

g

ucbq

sort

v42

adpc

m

epic

g721

pegw

it

mpe

g

jpeg

csb16 csb32 cbs64 cnv4w32 cnv1w32

csb: concatenate plus shutdown cache

Page 23: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 23

Configurable Cache

A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy

Well-suited for configurable devices like Triscend’s

Page 24: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 24

UCR Focus

Configurable Cache Hardware/Software Partitioning

Page 25: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 25

Using On-Chip FPGA to Reduce Sw Energy

Hennessey/Patterson: “The best way to save power is to have less

hardware” (pg 392) Actually, best way is to have less ACTIVE hw

Paradoxically, MORE hw can actually REDUCE power, as long as overall activity is reduced

How?

Page 26: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 26

Using On-Chip FPGA to Reduce Sw Energy

uP

L1 cache

DSP

JPEG dcd

Periph-erals

FPGA

Pre-fabricated Platform

Move critical sw loops to FPGA

Loop executes in 1/10th the time

Use this time to power down the system longer during task period

Alternatively, slow down the microprocessor using voltage scaling

ICFPGA

uP

idleuP active

idleuP FPGA

Task period

Page 27: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 27

The 90-10 rule (or 80-20 rule)

Most software time is spent in a few small loops

e.g., MediaBench and NetBench benchmarks

Known as the 90-10 rule

10% of the code accounts for 90% of the execution time

Move those loops to FPGA

g721 adpcm

pegwit dh

md5 tl

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Loop

Per

cen

t E

xecu

tio

n T

ime

Series1

Page 28: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 28

Hardware/Software Partitioning Results

Example Archit Cyclesorig Cyclessw Cycleshw

Loop Sp. Clkhw

Total Sp. Psw Phw Eorig Esw/hw ESav

PS_g3fax 8051 19,675,456 10,812,544 176,562 61 25 2.2 0.05 0.032 0.1142 0.05408 53%PS_crc 8051 291,196 180,224 7,168 25 25 2.5 0.05 0.028 0.0017 0.00071 58%PS_summin 8051 109,821,892 20,394,080 384,416 53 25 1.2 0.05 0.033 0.6376 0.53657 16%PS_brev 8051 330,064 305,768 1,360 225 25 12.9 0.05 0.034 0.0019 0.00015 92%PS_matmul 8051 119,420 101,576 2,560 40 25 5.9 0.05 0.035 0.0007 0.00012 82%PS_g3fax MIPS 15,600,000 4,720,000 599,000 8 100 1.4 0.07 0.111 0.0265 0.02163 18%PS_adpcm MIPS 113,000 29,300 5,440 5 100 1.3 0.07 0.181 0.0002 0.00018 6%PS_crc MIPS 5,040,000 3,480,000 460,800 8 100 2.5 0.07 0.061 0.0086 0.00379 56%PS_des MIPS 142,000 70,700 15,100 5 100 1.6 0.07 0.197 0.0002 0.00019 20%PS_engine MIPS 915,000 145,000 28,100 5 100 1.1 0.07 0.082 0.0016 0.00146 6%PS_jpeg MIPS 7,900,000 646,000 171,000 4 100 1.1 0.07 0.092 0.0134 0.01360 -1%PS_summin MIPS 2,920,000 1,270,000 266,000 5 100 1.5 0.07 0.111 0.0050 0.00375 24%PS_v42 MIPS 3,850,000 846,000 216,000 4 100 1.2 0.07 0.102 0.0065 0.00605 7%PS_brev MIPS 3,566 2,499 138 18 100 3.0 0.07 0.107 0.0000 0.00000 62%MB_g721 MIPS 838,230,002 457,674,179 9,985,261 46 100 2.1 0.07 0.152 1.4250 0.75035 47%MB_adpcm MIPS 32,894,094 32,866,110 1,183,260 28 42 11.6 0.07 0.130 0.0559 0.00821 85%MB_pegwit MIPS 42,752,919 33,276,287 2,167,651 15 50 3.1 0.07 0.170 0.0727 0.03241 55%NB_dh MIPS 1,793,032,157 1,349,063,192 45,156,767 30 69 3.5 0.07 0.121 3.0482 1.00547 67%NB_md5 MIPS 5,374,034 3,046,881 289,877 11 47 1.8 0.07 0.251 0.0091 0.00722 21%NB_tl MIPS 57,412,470 29,244,221 2,479,552 12 58 1.8 0.07 0.059 0.0976 0.05930 39%

Average: 30 3.2 Average: 34%

Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)

Simulation based

Page 29: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 29

Analysis of Ideal Speedup

Each loop is 10x faster in hw (average based on observations)

Notice the leveling off after the first couple loops (due to 90-10 rule)

Thus, most speedup comes from the first few loops

Good for us -- Moderate amount of FPGA gives most of the speedup

How much FPGA?

g721-10% adpcm-10%

pegwit-10% dh-10%

md5-10% tl-10%

url-10% avg-10%

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

ee

du

p

Series1

Page 30: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 30

Speedup Gained with Relatively Few Gates

Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates

Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002

Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of

Embedded Systems, 2002 (to appear).

1.0

2.0

3.0

4.0

5.0

0 5,000 10,000 15,000 20,000 25,000

Gates

Sp

ee

du

p

G721(MB)

ADPCM(MB)

PEGWIT(MB)

DH(NB)

MD5(NB)

TL(NB)

URL(NB)

27.2

2.05 at 90,000

Page 31: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 31

Impact of Microprocessor/FPGA Clock Ratio

Previous data assumed equal clock freq.

A faster microprocessor has significant impact Analyzed 1:1, 2:1, 3:1,

4:1, 5:1 ratios Planning additional

such analyses Memory bandwidth Power ratios More

g721 adpcm

pegwit dh

md5 tl

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

1 to 1

2 to 1

3 to 1

4 to 1

5 to 1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

1 to 1

2 to 1

3 to 1

4 to 1

5 to 1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

1 to 1

2 to 1

3 to 1

4 to 1

5 to 1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

1 to 1

2 to 1

3 to 1

4 to 1

5 to 1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

1 to 1

2 to 1

3 to 1

4 to 1

5 to 1

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Loop

Sp

eed

up

1 to 1

2 to 1

3 to 1

4 to 1

5 to 1

Page 32: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 32

Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement

Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better)

A7 results

Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav

PS_g3fax 11.47 7.44 1.5 1.320 1.332 15.140 9.910 35%PS_crc 10.92 4.51 2.4 1.320 1.320 14.414 5.953 59%PS_brev 9.84 3.28 3.0 1.332 1.344 13.107 4.408 66%

Average: 2.3 Average: 53%

E5 results

Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav

PS_g3fax 15.16 7.11 2.1 0.252 0.270 3.820 1.920 50%PS_crc 10.64 4.64 2.3 0.207 0.225 2.202 1.044 53%PS_brev 17.81 1.81 9.8 0.252 0.270 4.488 0.489 89%

Average: 4.8 Average: 64%

A7 IC

Triscend A7 development

board

Page 33: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 33

Other Research Directions: Tiny Caches

Impact of tiny caches on instruction fetch power Filter caches, dynamic loop cache, preloaded loop

cache Gordon-Ross, Cotterell, Vahid, Comp. Arch. Letters 2002 Gordon-Ross, Vahid, ICCD 2002. Cotterell, Vahid, ISSS 2002 and ICCAD 2002 Gordon-Ross, Cotterell, Vahid, IEEE TECS, 2002

Processor

Loop cache

L1 cache or I-mem

Mux

0102030405060708090

100

adpc

mbc

nt

binar

y blit

compr

ess

crc

des

engin

e fir

g3fa

xjpe

g

summin

ucbq

sort

v42

AVERAGE

benchmark

pe

rce

nt

sa

vin

gs config

30

config105

Page 34: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 34

Other Research Directions: Platform-Based CAD

Use physical platform to aid search of configuration space

Configure cache, hw/sw partition

Configure, execute, and measure

Goal: Define best cooperation between desktop CAD and platform

NSF grant 2002-2005 (with N. Dutt at UC Irvine)

Page 35: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 35

Other Research Directions: Dynamic Hw/Sw Partitioning

My favorite Add component on-chip:

Detects most frequent sw loops Decompiles a loop Performs compiler

optimizations Synthesizes to a netlist Places and routes the netlist

onto FPGA Updates sw to call FPGA

Self-improving IC Can be invisible to designer Appears as efficient processor Can also dynamically tune the

cache configuration

Config. Logic

MemProcessor

DMA

D$

I$

Profiler

Proc.

Mem

Page 36: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 36

Current Researchers Working in Embedded Systems at UCR

Prof. Frank Vahid 5 Ph.D. students, 2 M.S.

Prof. Walid Najjar 3 Ph.D. students, 1 M.S., working on hw/sw partitioning, and on

compiling C to FPGAs Prof. Tom Payne

1 Ph.D. student, working on compiling C to FPGAs Prof. Jun Yang (new hire)

Working on low power architectures (frequent value detection) Prof. Harry Hsieh

2 Ph.D. students, working on formal verification of system models

Prof. Sheldon Tan (new hire) 1 Ph.D, working on physical design, and analog synthesis

Page 37: Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science

Frank Vahid, UC Riverside 37

Conclusions

Highly configurable platforms have a bright future

Cost equations just don’t justify ASIC production as much as before

Triscend parts are well situated; close collaboration desired Configurable cache improves memory energy

Tuning to a particular program is CRUCIAL to low energy Way concatenation is effective at reducing dynamic power Way shutdown saves static power Variable line size reduces traffic All must be tuned to a particular program

Configurable logic improves software energy Without requiring excessive amounts of hardware

Many exciting avenues to investigate!