Upload
litzy-hallums
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Tuning of Loop Cache Architectures to Programs in Embedded System Design
Susan Cotterell and Frank Vahid*Department of Computer Science and Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems at UC
IrvineThis work was supported in part by the U.S. National Science Foundation and a
U.S. Department of Education GAANN Fellowship
2
Introduction
Opportunity to tune the microprocessor architecture to the program
Traditional
Core Basedmicroprocessor
architecture
3
Introduction
I$
JPEG
Processor
USB
D$
Bridge
CCDP P4
Mem
• I-cache– Size– Associativity– Replacement
policy
I$I$
JPEG
• JPEG– Compression
• Buses– Width– Bus invert/gray
code
JPEG
4
Introduction
• Memory access can consume 50% of an embedded microprocessor’s system power– Caches tend to be power
hungry
• M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99)
• ARM920T: caches consume half of total power (Segars 01)
arm925%
SysCtl3%
CP 152%
BIU8%
PATag RAM1%
Clocks4%
Other4%
D MMU5%
D Cache19%
I Cache25%
I MMU4%
5
Introduction
Advantageous to focus on the instruction fetching subsystem
Processor
USB
I$
D$
Bridge
JPEG CCDP P4
Mem
6
Introduction
• Techniques to reduce instruction fetch power– Program compression
• Compress only a subset of frequently used instructions (Benini 1999)
• Compress procedures in a small cache (Kirvoski 1997)
• Lookup table based (Lekatsas 2000)
– Bus encoding• Increment (Benini 1997)
• Bus-invert (Stan 1995)
• Binary/gray code (Mehta 1996)
7
Introduction
• Techniques to reduce instruction fetch power (cont.)– Efficient cache design
• Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998)
• Memory array partitioning and variation in cache sizes (Ko 1995)
– Tiny caches• Filter cache (Kin/Gupta/Magione-Smith 1997)• Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999)• Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002)
8
Cache Architectures – Filter Cache
• Small L0 direct mapped cache
• Utilizes standard tag comparison and miss logic
• Has low dynamic power– Short internal bitlines
– Close to the microprocessor
• Performance penalty of 21% due to high miss rate (Kin 1997)
Processor
Filter cache (L0)
L1 memory
9
Cache Architectures – Dynamically Loaded Loop Cache
• Small tagless loop cache
• Alternative location to fetch instructions
• Dynamically fills the loop cache– Triggered by short backwards
branch (sbb) instruction
• Flexible variation– Allows loops larger than the
loop cache to be partially stored
...add r1,2...sbb -5
Processor
Dynamic loop cache
L1 memory
Mux
Iteration 3 :fetch from loop cache
Dynamic loop cache
Iteration 1 :detect sbb instruction
L1 memory
Iteration 2 :fill loop cache
Dynamic loop cache
L1 memory
10
Cache Architectures – Dynamically Loaded Loop Cache (cont.)
• Limitations– Does not support loops with
control of flow changes (cofs)
– cofs terminate loop cache filling and fetching
– cofs include commonly found if-then-else statements
...add r1,2bne r1, r2, 3...sbb -5
Processor
Dynamic loop cache
L1 memory
Mux
Iteration 1 :detect sbb instruction
L1 memory
Iteration 3 :fill loop cache, terminate at cof
Dynamic loop cache
L1 memory
Iteration 2 :fill loop cache, terminate at cof
Dynamic loop cache
L1 memory
11
Processor
Preloaded loop cache
L1 memory
Mux
Cache Architectures – Preloaded Loop Cache
• Small tagless loop cache• Alternative location to fetch
instructions• Loop cache filled at
compile time and remains fixed– Supports loops with cof
• Fetch triggered by short backwards branch
• Start address variation– Fetch begins on first
loop iteration
...add r1,2bne r1, r2, 3...sbb -5
Iteration 1 :detect sbb instruction
L1 memory
Iteration 2 :check to see if loop preloaded, if so fetch from cache
Preloaded loop cache
L1 memory
12
Traditional Design
• Traditional Pre-fabricated IC– Typically optimized for best average
case
– Intended to run well across a variety of programs
– Benchmark suite is used to determine which configuration
• On average, what is the best tiny cache configuration?
Processor
L1 memory
Mux
?
13
Evaluation Framework – Candidate Cache Configurations
Type Size Number of loops/ line size
Configuration
Original dynamically loaded loop cache
8-1024 entries n/a 1-8
Flexible dynamically loaded loop cache
8-1024 entries n/a 9-16
Preloaded loop cache (sa)
8-1024 entries 2 - 3loop address registers
17-32
Preloaded loop cache (sbb)
8-1024 entries 2 - 6 loop address registers
33-72
Filter cache 8-1024 bytes line size of 8 to 64 bytes
73-106
14
Evaluation Framework – Motorola's Powerstone Benchmarks
Benchmark Lines of C # Instructions Executed Description
adpcm 501 63891 Voice Encoding
bcnt 90 1938 Bit Manipulation
binary 67 816 Binary Insertion
blit 94 22845 Graphics Application
compress 943 138573 Data Compression Program
crc 84 37650 Cyclic Redundancy Check
des 745 122214 Data Encryption Standard
engine 276 410607 Engine Controller
fir 173 16211 FIR Filtering
g3fax 606 1128023 Group Three Fax Decode
jpeg 540 4594721 JPEG Compression
summin 74 1909787 Handwriting Recognition
ucbqsort 209 219978 U.C.B Quick Sort
v42 553 2442551 Modem Encoding/Decoding
15
Simplified Tool Chain
Loop selector (preloaded)
lcsim lc power calculator
Loop cache stats
Loop cache power
Program instruction
trace
Technology info
16
Best on Average
-100
-80
-60-40
-20
0
20
4060
80
100
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
10
0
10
3
10
6
Loop Cache Configurations
Av
era
ge
Sa
vin
gs
original flexible preloaded (sa) preloaded (sbb) filter
• Configuration 30– Preloaded Loop cache (sa), 512
entries, 3 loop address registers
– 73% Instruction fetch energy savings
30
• Configuration 105– Filter cache, 1024 entries , line size
32 bytes
– 73% Instruction fetch energy savings
105
17
Core Based Design
• Core Based Design– Know application
– Opportunity to tune the architecture • Is it worth tuning the architecture to the application or is
the average case good enough?
microprocessor architecture
18
Best on Average
• Both configurations perform well for some benchmarks such as engine and summin
• However, both configurations perform below average for binary, v42, and others
0102030405060708090
100
benchmark
pe
rce
nt
sa
vin
gs conf ig
30
conf ig105
19
Results - binary
• Config 30 yields 61% savings
• Config 105 yields 65% savings
• Config 31 (preloaded/1024entry/2LAR) yields 79% savings
-100-80-60-40-20
020406080
100
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
103
106
Loop Cache Configurations
% S
avin
gs
original flexible preloaded (sa) preloaded (sbb) filter
10530 31
20
Results – v42
• Config 30 yields 58% savings
• Config 105 yields 23% savings
• Config 67 (preloaded/512entry/6LAR) yields 68%
-100-80-60-40-20
020406080
100
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
10
0
10
3
10
6
Loop Cache Configurations
Av
era
ge
Sa
vin
gs
original flexible preloaded (sa) preloaded (sbb) filter
10530 67
21
Results - averages
0
20
40
60
80
100
120
benchmark
pe
rce
nt
sa
vin
gs
bes tconfiguration
config 30
config 105
Average case Best case : 84%Config 30 : 73%Config 105: 73%Improvement : 11%
adpcm Best case : 68% (preloaded)Config 105: 25%Improvement : 43%
v42 Best case : 68% (preloaded)Config 105: 23%Improvement : 45%
blit Best case : 96% (flexible)Config 30: 87%Improvement : 9%
jpeg Best case : 92% (filter)Config 30: 69%Improvement : 23%
22
Conclusion and Future Work
• Shown benefits of tuning the tiny cache to a particular program– On average yields an additional 11%
– Up to an additional 40% for some programs
• Environment automated but requires several hours to find best configuration– Current methodology is too slow
– Faster method based on equations described in upcoming ICCAD 2002