22
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Department of Education GAANN Fellowship

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

Embed Size (px)

Citation preview

Page 1: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

Tuning of Loop Cache Architectures to Programs in Embedded System Design

Susan Cotterell and Frank Vahid*Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC

IrvineThis work was supported in part by the U.S. National Science Foundation and a

U.S. Department of Education GAANN Fellowship

Page 2: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

2

Introduction

Opportunity to tune the microprocessor architecture to the program

Traditional

Core Basedmicroprocessor

architecture

Page 3: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

3

Introduction

I$

JPEG

Processor

USB

D$

Bridge

CCDP P4

Mem

• I-cache– Size– Associativity– Replacement

policy

I$I$

JPEG

• JPEG– Compression

• Buses– Width– Bus invert/gray

code

JPEG

Page 4: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

4

Introduction

• Memory access can consume 50% of an embedded microprocessor’s system power– Caches tend to be power

hungry

• M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99)

• ARM920T: caches consume half of total power (Segars 01)

arm925%

SysCtl3%

CP 152%

BIU8%

PATag RAM1%

Clocks4%

Other4%

D MMU5%

D Cache19%

I Cache25%

I MMU4%

Page 5: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

5

Introduction

Advantageous to focus on the instruction fetching subsystem

Processor

USB

I$

D$

Bridge

JPEG CCDP P4

Mem

Page 6: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

6

Introduction

• Techniques to reduce instruction fetch power– Program compression

• Compress only a subset of frequently used instructions (Benini 1999)

• Compress procedures in a small cache (Kirvoski 1997)

• Lookup table based (Lekatsas 2000)

– Bus encoding• Increment (Benini 1997)

• Bus-invert (Stan 1995)

• Binary/gray code (Mehta 1996)

Page 7: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

7

Introduction

• Techniques to reduce instruction fetch power (cont.)– Efficient cache design

• Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998)

• Memory array partitioning and variation in cache sizes (Ko 1995)

– Tiny caches• Filter cache (Kin/Gupta/Magione-Smith 1997)• Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999)• Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002)

Page 8: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

8

Cache Architectures – Filter Cache

• Small L0 direct mapped cache

• Utilizes standard tag comparison and miss logic

• Has low dynamic power– Short internal bitlines

– Close to the microprocessor

• Performance penalty of 21% due to high miss rate (Kin 1997)

Processor

Filter cache (L0)

L1 memory

Page 9: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

9

Cache Architectures – Dynamically Loaded Loop Cache

• Small tagless loop cache

• Alternative location to fetch instructions

• Dynamically fills the loop cache– Triggered by short backwards

branch (sbb) instruction

• Flexible variation– Allows loops larger than the

loop cache to be partially stored

...add r1,2...sbb -5

Processor

Dynamic loop cache

L1 memory

Mux

Iteration 3 :fetch from loop cache

Dynamic loop cache

Iteration 1 :detect sbb instruction

L1 memory

Iteration 2 :fill loop cache

Dynamic loop cache

L1 memory

Page 10: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

10

Cache Architectures – Dynamically Loaded Loop Cache (cont.)

• Limitations– Does not support loops with

control of flow changes (cofs)

– cofs terminate loop cache filling and fetching

– cofs include commonly found if-then-else statements

...add r1,2bne r1, r2, 3...sbb -5

Processor

Dynamic loop cache

L1 memory

Mux

Iteration 1 :detect sbb instruction

L1 memory

Iteration 3 :fill loop cache, terminate at cof

Dynamic loop cache

L1 memory

Iteration 2 :fill loop cache, terminate at cof

Dynamic loop cache

L1 memory

Page 11: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

11

Processor

Preloaded loop cache

L1 memory

Mux

Cache Architectures – Preloaded Loop Cache

• Small tagless loop cache• Alternative location to fetch

instructions• Loop cache filled at

compile time and remains fixed– Supports loops with cof

• Fetch triggered by short backwards branch

• Start address variation– Fetch begins on first

loop iteration

...add r1,2bne r1, r2, 3...sbb -5

Iteration 1 :detect sbb instruction

L1 memory

Iteration 2 :check to see if loop preloaded, if so fetch from cache

Preloaded loop cache

L1 memory

Page 12: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

12

Traditional Design

• Traditional Pre-fabricated IC– Typically optimized for best average

case

– Intended to run well across a variety of programs

– Benchmark suite is used to determine which configuration

• On average, what is the best tiny cache configuration?

Processor

L1 memory

Mux

?

Page 13: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

13

Evaluation Framework – Candidate Cache Configurations

Type Size Number of loops/ line size

Configuration

Original dynamically loaded loop cache

8-1024 entries n/a 1-8

Flexible dynamically loaded loop cache

8-1024 entries n/a 9-16

Preloaded loop cache (sa)

8-1024 entries 2 - 3loop address registers

17-32

Preloaded loop cache (sbb)

8-1024 entries 2 - 6 loop address registers

33-72

Filter cache 8-1024 bytes line size of 8 to 64 bytes

73-106

Page 14: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

14

Evaluation Framework – Motorola's Powerstone Benchmarks

Benchmark Lines of C # Instructions Executed Description

adpcm 501 63891 Voice Encoding

bcnt 90 1938 Bit Manipulation

binary 67 816 Binary Insertion

blit 94 22845 Graphics Application

compress 943 138573 Data Compression Program

crc 84 37650 Cyclic Redundancy Check

des 745 122214 Data Encryption Standard

engine 276 410607 Engine Controller

fir 173 16211 FIR Filtering

g3fax 606 1128023 Group Three Fax Decode

jpeg 540 4594721 JPEG Compression

summin 74 1909787 Handwriting Recognition

ucbqsort 209 219978 U.C.B Quick Sort

v42 553 2442551 Modem Encoding/Decoding

Page 15: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

15

Simplified Tool Chain

Loop selector (preloaded)

lcsim lc power calculator

Loop cache stats

Loop cache power

Program instruction

trace

Technology info

Page 16: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

16

Best on Average

-100

-80

-60-40

-20

0

20

4060

80

100

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

10

0

10

3

10

6

Loop Cache Configurations

Av

era

ge

Sa

vin

gs

original flexible preloaded (sa) preloaded (sbb) filter

• Configuration 30– Preloaded Loop cache (sa), 512

entries, 3 loop address registers

– 73% Instruction fetch energy savings

30

• Configuration 105– Filter cache, 1024 entries , line size

32 bytes

– 73% Instruction fetch energy savings

105

Page 17: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

17

Core Based Design

• Core Based Design– Know application

– Opportunity to tune the architecture • Is it worth tuning the architecture to the application or is

the average case good enough?

microprocessor architecture

Page 18: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

18

Best on Average

• Both configurations perform well for some benchmarks such as engine and summin

• However, both configurations perform below average for binary, v42, and others

0102030405060708090

100

benchmark

pe

rce

nt

sa

vin

gs conf ig

30

conf ig105

Page 19: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

19

Results - binary

• Config 30 yields 61% savings

• Config 105 yields 65% savings

• Config 31 (preloaded/1024entry/2LAR) yields 79% savings

-100-80-60-40-20

020406080

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

103

106

Loop Cache Configurations

% S

avin

gs

original flexible preloaded (sa) preloaded (sbb) filter

10530 31

Page 20: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

20

Results – v42

• Config 30 yields 58% savings

• Config 105 yields 23% savings

• Config 67 (preloaded/512entry/6LAR) yields 68%

-100-80-60-40-20

020406080

100

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

10

0

10

3

10

6

Loop Cache Configurations

Av

era

ge

Sa

vin

gs

original flexible preloaded (sa) preloaded (sbb) filter

10530 67

Page 21: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

21

Results - averages

0

20

40

60

80

100

120

benchmark

pe

rce

nt

sa

vin

gs

bes tconfiguration

config 30

config 105

Average case Best case : 84%Config 30 : 73%Config 105: 73%Improvement : 11%

adpcm Best case : 68% (preloaded)Config 105: 25%Improvement : 43%

v42 Best case : 68% (preloaded)Config 105: 23%Improvement : 45%

blit Best case : 96% (flexible)Config 30: 87%Improvement : 9%

jpeg Best case : 92% (filter)Config 30: 69%Improvement : 23%

Page 22: Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering

22

Conclusion and Future Work

• Shown benefits of tuning the tiny cache to a particular program– On average yields an additional 11%

– Up to an additional 40% for some programs

• Environment automated but requires several hours to find best configuration– Current methodology is too slow

– Faster method based on equations described in upcoming ICCAD 2002