View
214
Download
0
Tags:
Embed Size (px)
Citation preview
A One-Shot Configurable-Cache Tuner for Improved Energy and Performance
Ann Gordon-Ross1, Pablo Viana2, Frank Vahid1, Walid Najjar1, and Edna Barros4
1Dept of Computer Science & Engineering - University of California, Riverside, USA2Campus Arapiraca – Federal University of Alagoas, Brazil
3Centro de Informática - Federal University of Pernambuco, Brazil
This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation
2Ann Gordon-RossUniv of Ca, Riverside
Introduction• Memory access: 50% of embedded processor’s
system power
• Caches are power hungry
• ARM920T (Segars 01)
• M*CORE (Lee/Moyer/Arends 99)
• Thus, caches are a good candidate for optimizations
53%
Main Mem
L1 I Cache
Processor
L1 D Cache
3Ann Gordon-RossUniv of Ca, Riverside
Introduction• Different applications have vastly different cache
requirements
• Total size, line size, and associativity
• Cache parameters that don’t match an application’s behavior can waste over 60% of energy (Gordon-Ross 05)
• Cache tuning is the process of determining the appropriate cache parameters for an application
4KB 16 byte2-way
2KB 32 byte
direct-mapped8KB
64 byte4-way
4Ann Gordon-RossUniv of Ca, Riverside
Download application
Runtime Cache Tuning• Best cache configuration can be determined by
searching the design space during runtime
• Runtime cache tuning is transparent to the designer and end user, but incurs runtime overhead in terms of energy and performance
Ene
rgy
Executing in base configuration
Tunable cache
Tuning hw
TC Cache TuningTCTCTCTCTCTC TCTCTCTC
5Ann Gordon-RossUniv of Ca, Riverside
Download application
Contribution• We introduce specialized hardware for non-intrusive runtime
cache evaluation
• Temporary energy overhead and no performance overhead
• Single-pass multi-cache evaluation - SPCE
• Special hardware simultaneously evaluates all cache configurations
• Enables switching to the best configuration in one-shot
Tunable cache
SPCE
Ene
rgy
Executing in base configuration SPCE causes an increase
in energy but no performance overhead
Switch to best config in “one-shot”
SPCESPCE
TC
6Ann Gordon-RossUniv of Ca, Riverside
SPCE Key Points• Contributions compared to previous methods
• Evaluates a highly configurable cache
–Previous method offer little configurability
• Little hardware overhead
–Simple data structures
–Elementary operations
7Ann Gordon-RossUniv of Ca, Riverside
SPCE• Monitors address stream to extract cache hit
information for all configurations
Fully-associative cache example(64-bit architecture)
Address stream
t0 = 0
t1 = 8t2 = 16t3 = 0t4 = 8t5 = 0t6 = 16
Table(stored hit info)
bd 0 1 212345678
Line size (number of words)
Number of lines
24 different configs
Number of conflicts determines cache sizes that would result in a hit
For each line size …
>> 20*8
t0 = 0
t1 = 1t2 = 2t3 = 0t4 = 1t5 = 0t6 = 2
HIT
}31
HIT
}3 2
HIT
}2
1
HIT
}3
3>> 21*8
t0 = 0
t1 = 0t2 = 1t3 = 0t4 = 0t5 = 0t6 = 1
HIT1
HIT
1
HIT
2
HIT
3
HIT
2
>> 22*8
t0 = 0
t1 = 0t2 = 0t3 = 0t4 = 0t5 = 0t6 = 0
HITHITHITHITHITHIT
6
Cache with 2 lines with 21 words per line (32 bytes) will have 5 hits and 7-
5=2 misses
8Ann Gordon-RossUniv of Ca, Riverside
SPCE• SPCE determines hits for other set-associativities
by counting the number of unique conflicts in the address trace
Tables(multiple layers)
Direct-mapped2-way
4-way
Table(stored hit info)
bs 0 1 212345678
Line size (number of words)
Number of sets
9Ann Gordon-RossUniv of Ca, Riverside
SPCE - Hardware
(stack)• Designed and evaluated in
synthesizable VHDL
10Ann Gordon-RossUniv of Ca, Riverside
Results - Energy Savings• Energy savings compared to exploring the design
space using a state-of-the-art intrusive heuristic (Zhang 03)
• Values less than 1 denote an energy increase
0.99 0.77 0.98
0
4
8
12
brevepic
rawcaudio
pocsag
g721Decode
pegwitDecodev42
ucbqsort
g3faxbilv
binary blit
matmul
mpegDecode
pegwitEncode
jpegDecode
ps-jpeg fi
r
jpegEncodebcnt
average
4.6x less energy
expended
11Ann Gordon-RossUniv of Ca, Riverside
Results - Tuning Speedup• Tuning speedup obtained compared to a state-of-
the-art intrusive heuristic
0
4
8
12
16
brevepic
rawcaudio
pocsag
g721Decode
pegwitDecodev42
ucbqsort
g3faxbilv
binary blit
matmul
mpegDecode
pegwitEncode
jpegDecode
ps-jpeg fi
r
jpegEncodebcnt
average
7.7x faster
12Ann Gordon-RossUniv of Ca, Riverside
Overheads• Evaluated SPCE compared to the ARM920T
• Area
• 12% area overhead
–Due in large part to the TCAM stack structure
• Power
• Temporary 2.2X increase in power during short tuning cycle
–Application need only iterate 4 times for average power overhead to reduce to 1%
13Ann Gordon-RossUniv of Ca, Riverside
Conclusions• SPCE is a specialized hardware structure to
evaluate all cache configurations simultaneously
• Enables non-intrusive runtime cache evaluation
• Enables switching to best cache configuration in one shot
• Compared to a state-of-the-art intrusive cache tuning heuristic
• 4.6x less energy expended
• 7.7x speedup in tuning time
• 12% area overhead compared to ARM920T
• Temporary 2.2x increase in power during short tuning time
–Only 4 application iterations to recoup power