Upload
madlyn-hodges
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy
Consumption
Ann Gordon-RossDepartment of Computer Science and Engineering
University of California, RiversideFrank Vahid – PhD Advisor
This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation
Ann Gordon-Ross, UC Riverside 2 of 45
Introduction
Much research is devoted to reducing power consumption in mobile embedded devices Increased battery life Decreased cooling requirements
Ann Gordon-Ross, UC Riverside 3 of 45
Introduction Cache hierarchy consumes a lot of power We can use configurable caches to reduce power
consumption However, configuring/tuning the cache is very difficult
Many parameters lead to a very large design space In this talk, I describe research that addresses the
problem of quickly tuning highly configurable caches Efficient heuristics for increasingly-complex configurable
cache hierarchies Feedback-control system for online cache tuning
Ann Gordon-Ross, UC Riverside 4 of 45
Cache Power Consumption Memory access: 50% of embedded processor’s system power
Caches are power hungry ARM920T (Segars 01) M*CORE (Lee/Moyer/Arends 99)
Thus, caches are a good candidate for optimizations
Main Mem
L1 Cache
Processor
L2 Cache
53%
Ann Gordon-Ross, UC Riverside 5 of 45
Reducing Cache Energy Consumption
Research shows that different applications have different cache requirements – Zhang ‘04 Depending on the working set of the application, the
application may require different values for cache parameters:
Total size Line size (block size) Associativity
Cache parameters that don’t match an application’s behavior can waste over 40% of energy Balasubramonian ’00, Zhang ’03
Ann Gordon-Ross, UC Riverside 6 of 45
Excess energy
Excess Cache Energy Consumption
Size Excess fetch and static energy
if too large
= working
set
Excess thrashing energy if too small
to next level of memory
Stall cycles = excess energy
Line size Excess fetch energy if line size
too large
= fetched
Excess energy fetching unused
data
Excess stall energy if line size too small from next level of memory
Stall cycles = excess energy
Ann Gordon-Ross, UC Riverside 7 of 45
Excess energy checking unused ways
Excess Cache Energy Consumption
Associativity Excess fetch energy per access
if too high
= working
set
Excess miss energy if too low – decreased performance
Configurable caches allow for cache parameter values to be varied or tuned thus specializing the cache to the needs of an application
Ann Gordon-Ross, UC Riverside 8 of 45
Configurable Caches Soft cores – designer specified cache
parameters ARM, MIPS, Tensillica
Processor - HDLSpecialized cache Chip with
specialized cache
Fab
Ann Gordon-Ross, UC Riverside 9 of 45
Configurable Caches Even hard processors contain configurable caches
Specialized software instructions can change cache parameters Specialized hardware enables the cache to be configured at startup
or in system during runtime Motorola M*CORE – Malik ISLPED’00, Albonesi MICRO’00, Zhang
ISCA’03
2K
B
2K
B
2K
B
2K
B
8 KB, 4-way base cache
2K
B
2K
B
2K
B
2K
B
8 KB, 2-way
2K
B
2K
B
2K
B
2K
B
8 KB, direct-mapped
Way concatenation
2K
B
2K
B
2K
B
2K
B
4 KB, 2-way2
KB
2K
B
2K
B
2K
B
2 KB, direct-mapped
Way shutdown
Configurable Line size
16 byte physical line size
Tunable cache
Tuning hw
Ann Gordon-Ross, UC Riverside 10 of 45
Cache Tuning However, configurable caches are relatively new Designers are provided with configurable caches
but are not told how to determine the best cache configuration
Cache tuning is the process of determining the appropriate cache parameters for an application
Cache tuning is very difficult - 100’s to 10000’s of different configurations
Ann Gordon-Ross, UC Riverside 11 of 45
Cache Tuning Difficulties
Simulation method
Microprocessor
L2 cache
L1 cache
Main Memory
TUNE
TUNE
Choose lowest energy configuration
Possible Cache Configurations
Ene
rgy
Realistic input stimulus is difficult to model
inputA few seconds of real
execution may take days or weeks to simulate
Prediction method
Chosen config
Examine the code
Ann Gordon-Ross, UC Riverside 12 of 45
Cache Tuning Difficulties
Runtime tuning
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Download application
Time
Ene
rgy
System startup
Cache tuning
Exhaustive exploration can unnecessarily
expand this high energy tuning time
Tunable cache
Tuning hw
Runtime tuning allows for
adaptation to new software and new
operating environments
Ann Gordon-Ross, UC Riverside 13 of 45
Cache Tuning Difficulties
Heuristic tuning method
Design space
100’s – 10,000’s
Lowest
energy
Simulation based approach
Possible Cache ConfigurationsE
nerg
y
Exhaustive method
Possible Cache Configurations
Ene
rgy
Heuristic method
Runtime based approach
System Startup
Ene
rgy
Exhaustive method
System Startup
Ene
rgy
Heuristic method
Existing heuristics do not address the complexities of tuning a highly configurable cache consisting of 10,000’s of different configurations
Ann Gordon-Ross, UC Riverside 14 of 45
Outline
Develop an efficient tuning heuristic for a highly configurable two-level cache hierarchy Develop using a simulation-based
environment but is applicable to a dynamic tuning environment
62% energy savings on average Current research
Feedback-control system for online cache tuning
Ann Gordon-Ross, UC Riverside 15 of 45
Challenge for Two Level Cache Tuning Heuristic Development
Current methods
L1
- Size- Line size- Assoc.
10’s of configurations
Single level configuration
Two-level configuration
10’s of configurations
L1 L2
Hierarchy
L1 L2
HierarchyL1 L2
Hierarchy
L1 L2
Hierarchy …
Our Two-level Cache Tuning Goal
Two-level configuration with separate
L2 caches
L1
- Size- Line size- Assoc.
- Size- Line size- Assoc.
D
I
L2
- Size- Line size- Assoc.
- Size- Line size- Assoc.
30 configs per cache
**
30*30 + 30*30 = 1800 configs
L1
- Size- Line size- Assoc.
- Size- Line size- Assoc.
D
I- Size- Line size- Assoc.
L2
**
30*30*30 = 27,000 configs
Two-level configuratio
n with a unified
second level of cache
Ann Gordon-Ross, UC Riverside 16 of 45
Single Level Tuning Heuristic
Mic
ropr
oces
sor
Mai
n M
emor
yI$
D$
Tuner
Zhang’s Configurable Cache
18 configurations per cache
Independently tuned
L1
Mic
ropr
oces
sor
Mai
n M
emor
yI$
D$
Tuner
I$
D$
Our Extended Configurable Cache
216 configurations per cache hierarchy
L1L2
Tuning dependenc
y
Impact-ordered heuristics have been shown effective in previous tuning efforts (Zhang’03)
Tune parameters in order of energy impact – highest impact first i.e., vary each parameter while holding others fixed, measure change Impact order for cache: 1. Total size 2. line size 3. associativity
Search parameters from smallest to largest Minimize flushing in a dynamic environment
Tune instruction cache then tune data cache
Ann Gordon-Ross, UC Riverside 17 of 45
First Heuristic – Tune Levels One-at-a-Time
Tune each cache using impact-ordered heuristic for one-level cache tuning
Tune L1, the L2 Initial L2: 64 KByte, 4-way, 64 byte line
size For best L1 configuration, tune L2 cache
Microprocessor
Main Memory
L1 Cache
L2 Cache
Ann Gordon-Ross, UC Riverside 18 of 45
Results of First Heuristic Base cache configuration
Level 1 – 8KByte, 4-way, 32 byte line size Level 2 – 64KByte, 4-way, 64 byte line size
0
0.2
0.4
0.6
0.8
1
1.2
g721
rawcaudio
pegwitAIFFTR01AIFIRF01BITMNP01IDCTRN01PNTRCH01TTSPRK01
average
FirstHeuristic
Optimal
Energ
y c
onsu
mpti
on
norm
aliz
ed t
o t
he b
ase
cach
e
configura
tion
Base line
32% vs 53%
Worse than base cache
Ann Gordon-Ross, UC Riverside 19 of 45
Interlacing Heuristic Did not find optimal in most cases
Sometimes 200% or 300% worse Conclusion: The two levels should not be explored separately
Too much interdependence among L1 and L2 cache parameters – not addressed with Zhang’s method
L2 cache performance depends on how much and what misses in the L1 cache To more fully explore the dependencies between the two levels, we
interlaced the exploration of the level one and level two caches
Interlacing performed better than the initial heuristic but there was still much room for improvement
Mic
ropr
oces
sor
Mai
n M
emor
yI$
D$
Tuner
I$
D$
L1L2
1. Tune L1 Size1. Tune L1 Size I$ I$ 2. Tune L2 Size2. Tune L2 SizeI$3. Tune L1 Line Size3. Tune L1 Line Size
I$
4. Tune L2 Line Size4. Tune L2 Line SizeI$
5. Tune L1 Associativity5. Tune L1 Associativity
I$
6. Tune L2 Associativity6. Tune L2 Associativity
Do the same for the data cache hierarchy
Ann Gordon-Ross, UC Riverside 20 of 45
Final Heuristic: Interlaced with Local Search
Some cases were still sub-optimal - manually examined
Limitation of the configurable cache architecture Certain associativities were not possible for
some sizes Determined small local search needed to overcome
the limitation Final heuristic - The Two Level Cache Tuner (TCaT)
Ann Gordon-Ross, UC Riverside 21 of 45
TCaT Results
0
0.2
0.4
0.6
0.8
1
1.2
g721
rawcaudiopegwit
AIFFTR01AIFIRF01BITMNP01IDCTRN01PNTRCH01TTSPRK01average
FirstHeuristicTCaT
Optimal
Energ
y c
onsu
mpti
on
norm
aliz
ed t
o t
he b
ase
cach
e
configura
tion
Base line
53% energy savings –
near optimal
Ann Gordon-Ross, UC Riverside 22 of 45
Extending the TCaT - Exploring a Unified Second Level of Cache
Unified second level caches are standard in desktop computers and are becoming increasingly popular in embedded microprocessors
Current cache tuning heuristics do not directly apply due to the added circular dependency
A change in any cache affects the performance of all other caches in the
hierarchyMic
ropr
oces
sor
Mai
n M
emor
yI$
D$
Tuner U$
Ann Gordon-Ross, UC Riverside 23 of 45
Level Two Cache Configurability
For maximum configurability, the level two cache utilized the Motorola M*CORE style way management
U-w
ay
U-w
ay
U-w
ay
U-w
ayTraditional,
4-way unified
level two cache
Motorola M*CORE way management
cache Cfg
W
ay
Cfg
W
ay
Cfg
W
ay
Cfg
W
ay
I-w
ay
D-w
ay
U-w
ay
I-w
ay
D-w
ay
U-w
ay
I-w
ay
D-w
ay
U-w
ay
I-w
ay
D-w
ay
U-w
ay
I-w
ay
D-w
ay
U-w
ay
I-w
ay
I-w
ay
U-w
ay
I-w
ay
D-w
ay
D-w
ay
Sample way management L2 caches
In addition, the L2 cache offers the same line
size configurability as in the L1 caches
Design space explodes Design space explodes to 18,000 configurationsto 18,000 configurations
Ann Gordon-Ross, UC Riverside 24 of 45
Alternating Cache Exploration with Additive Way Tuning (ACE-AWT)
D
Tune level one sizes
I
Tune level two size
I
Tune level one line sizes
D
Tune level two line size
Tune level two associativity
{ }
{ }I{ }D
Tune level one associativities
These steps are difficult because changing size and associativity is
synonymous in a way management style cache
Ann Gordon-Ross, UC Riverside 25 of 45
Way ManagementI-
way
D-w
ay
U-w
ay
8Kb 1-way
Increase L2 size
I-w
ay
D-w
ay
U-w
ay
16Kb 2-way
I-w
ay
I-w
ay
U-w
ay
16Kb 2-way
I-w
ay
U-w
ay
U-w
ay
16Kb 2-way
I-w
ay
D-w
ay
U-w
ay
24Kb 3-way
Decrease L2
associativity I-
way
D-w
ay
U-w
ay
16Kb 2-way
I-w
ay
D-w
ay
U-w
ay
16Kb 2-way
I-w
ay
D-w
ay
U-w
ay
16Kb 2-way
Ann Gordon-Ross, UC Riverside 26 of 45
ACE-AWT First Phase – L2 Size Exploration
Start with
empty L2 cache
Current L2
config
Simulate
Simulate
Simulate
I-w
ay
D-w
ay
U-w
ay
+
+
+
Add one of each
way type…
Current L2
config
I-w
ay
Current L2
config
Current L2
config
D-w
ay
U-w
ay
=
=
=
…resulting in 3
candidate configs
Select minimu
m energy
ener
gy
energy
energy
If cache max size
cmp
energ
y
DONE
If increase
in energy
If decrease in energy
Min energy
cfg
Min energy
cfg
Min energy
cfg
Min energy
cfg
Min energy
cfg
Min energy
cfg
Min energy
cfg
Min energy
cfg
Current L2
config
Selected L2 cfg
Ann Gordon-Ross, UC Riverside 28 of 45
Simulate
Simulate
Simulate
Simulate
Simulate
Simulate
ACE-AWT Fine Tuning Phase – Associativity Exploration
Start with current cache
configuration
Current L2 cfg
Size and availability
permitting, try 3 way additions
and removals …
I-w
ay
D-w
ay
U-w
ay
I-w
ay
D-w
ay
U-w
ay
+++
---
Current L2 cfg
Current L2 cfg
Current L2 cfg
I-w
ay
D-w
ay
U-w
ay
Current L2 cfg
Current L2 cfg
Current L2 cfg
I-w
ay
D-w
ay
U-w
ay
… resulting in 6
candidate configs
=
=
=
=
=
=
Select minimu
m energy
ener
gy
energyenergyenergy
energy
energy
cmp
energ
y
DONE
If increase
in energy
If decrease in energy
Min energy
cfg
Current L2 cfg
If no new configuration to explore
Selected L2 cfg
Ann Gordon-Ross, UC Riverside 29 of 45
Results
Heuristic achieved near optimal results (when optimal computed) 62% energy savings compared to base cache Yet only searched 0.2% of the search space
Key to previous heuristics Combined proven space pruning method (impact-ordering of
parameters) with architecture-specific knowledge highly-efficient and effective results
0.0
0.2
0.4
0.6
0.8
1.0
A2TIME01BaseFP01
CACHEB01CANRDR01IIRFLT01
MATRIX01PUWMOD01
RSPEED01TBLOOK01AIFFTR01AIIFFT01AIFIRF01
BITMNP01IDCTRN01PNTRCH01TTSPRK01
bcnt bilvbinary
blitbrevg3fax
matmulpocsagps-jpegucbqsort
v42 avg
Energy consumption normalized to the
base cache configuration
ACE-AWT Optimal
Base line
Ann Gordon-Ross, UC Riverside 30 of 45
Outline Develop an efficient tuning heuristic
for a highly configurable two-level cache hierarchy Develop using a simulation-based environment
but is applicable to a dynamic tuning environment
62% energy savings on average Current research
Feedback-control system for online cache tuning
Ann Gordon-Ross, UC Riverside 31 of 45
Online Cache Tuning Reconfigure the cache dynamically to adapt to
different phases of program execution or different applications in a multi-application environment
Base cache energy
Application-tuned
TimeEnerg
y C
onsu
mpti
on
Phase-tuned
Change cache
Ann Gordon-Ross, UC Riverside 32 of 45
Online Cache Tuning Challenges
Need a good tuning interval Tuning interval is the time between invocations of the
tuning hardware Should closely match phase interval - length of time the
system executes between phase changes
Base cache energy
TimeEnerg
y C
onsu
mpti
on
Phase Interval
Base cache energy
TimeEnerg
y C
onsu
mpti
on
Runtime energy
Tuning interval
Excess tuning energy
Tuning interval
too short
Tuning interval too long
Base cache energy
TimeEnerg
y C
onsu
mpti
on Runtime
energy
Tuning interval
Wasted energy in suboptimal configuration
Ann Gordon-Ross, UC Riverside 33 of 45
Previous Online Cache Tuning
Largely ad hoc Fixed tuning interval
Inspect counters and adjust cache Search very small configuration space ≈ 4
Limited tuning overhead Adjusted tuning thresholds
Do not analyze the chosen tuning interval None attempted to tune the tuning
interval
Ann Gordon-Ross, UC Riverside 34 of 45
Periodic System
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1m2m3m4m5m6m7m8m9m10m11m12m13m14m15m16m17m18m19m20mTuning interval (millions of cycles)
Normalized energy
Online tuningenergynormalized tobase
Phase interval fixed at 10
million cyclesTuning interval
too shortTuning interval
too long
Energy savings = 32%
Severely penalized if phase interval is not precisely followed
Energy savings = 28%
Penalty is acceptable
Goal: Tuning interval should be 1/2 of the phase interval
Ann Gordon-Ross, UC Riverside 35 of 45
Online Algorithms
Need to determine tuning interval while system is executing
Online algorithms process data piecemeal - unable to view entire dataset Online tuner must be able to determine
the tuning interval based on current and past events with no knowledge of future
Ann Gordon-Ross, UC Riverside 36 of 45
Feedback Control System
Plant (System under control)
Set-Point(Goal)
Actuator(device to
manipulate plant)
Controller (compute input to plant)
ut = F(xt)∑
Error detector
Reference input rt
Sensor
Measured error
Disturbances
Difficulty: Set-points are typically fixed values. We want minimization of energy which makes
developing the control system much more difficult.
Ann Gordon-Ross, UC Riverside 37 of 45
Online Cache Tuner
Goal: Adjust tuning interval to match phase interval
Observe change in energy due to tuning Compare energy before and after tuning If there is a change, then tuning interval is
too long, missed a phase change If there is no change, then tuning interval is
too short
Ann Gordon-Ross, UC Riverside 38 of 45
Online Cache Tuner - Feedback Control System
Plant (System under control)
Set-Point(Goal)
Actuator(device to
manipulate plant)
Controller (compute input to plant)
ut = F(xt)∑
Error detector
Reference input rt
Sensor
Measured error
Disturbances
Plant (Microprocessor)$
Set-Point(minimiz
e energy)
Actuator
Cache Tuner
Controller (activate cache tuner on tuning interval)
Miss rate
Sensor(energy model)
Store previou
s energy
(phase changes)
%∆E
Ann Gordon-Ross, UC Riverside 39 of 45
Controller Logic
Based on attack/decay online algorithm Increase tuning interval slow to avoid
overshooting Decrease tuning interval quickly to avoid wasted
energy Draw on fuzzy logic to stabilize tuning
interval Change tuning interval based on how close or far
the system is to being stable 2 part equation
Ann Gordon-Ross, UC Riverside 40 of 45
Controller Logic
%∆E
0100%
Change t
o t
unin
g inte
rval (∆
TI)
Stable System
PoS
1.0
Large energy change, tunes
too infrequently,
decrease interval
Small energy change,
tunes too frequently, increase interval
U
D
If %∆E < PoS,
€
y =1−U
PoSx + U
If %∆E >= PoS,
€
y =D −1
1− PoSx +1−
D −1
1− PoSPoS %∆E averaged
over last W measurements to eliminate erratic
behavior
Determine U, D, PoS and W through experimentation
%∆E
∆TI
Ann Gordon-Ross, UC Riverside 41 of 45
Tracking Interval Length Over Time
0
2000000
4000000
6000000
8000000
10000000
12000000
1 251 501 751 1001 1251 1501 1751 2001
Execution time (10k cycles)
Cycles
Phaseinterval
Tuninginterval
Tuning interval
oscillates near 1/2 of the
phase interval
Ann Gordon-Ross, UC Riverside 42 of 45
Online Cache Tuner Energy Savings
00.20.40.60.8
11.21.41.61.8
2
ps-jpeg/v42blit/g721Decbinary/pocsagjpegEnc/jpegDec
bcnt/epic
pegwitDec/g3fax
fir/bilv
ucbqsort/brevmatmul/mpegDec
pegwitEnc/rawcaudio
average
Optimal Tuner Tuning int = 1/2 phase int
Variable tuning interval
Base line
Observed similar results for less periodic systems.
29% energy
savings - within 8% of optimal
Norm
aliz
ed E
nerg
y
Ann Gordon-Ross, UC Riverside 43 of 45
Conclusions Developed a very efficient cache tuning heuristic
for a highly configurable cache Offers 18,000 different cache configurations 62% energy savings in the cache hierarchy while only
searching 0.2% of the search space Key: Combination of efficient heuristic method with
knowledge of architecture features Developed a feedback control system for online
cache tuning 29% energy savings on average - 8% from optimal Key: Application of control theory to online cache tuning Continuing work for more random systems
Ann Gordon-Ross, UC Riverside 44 of 45
Future Work Future work
Dynamic optimizations in a multi-core environment Cache hierarchy – some levels may be shared Dynamic load distribution Dynamic per-core shutdown or voltage reduction for reduced
power consumption Etc – Many single-core optimizations can be non-trivially
applied to a multi-core environment Dynamic tuning enables energy savings with no extra
designer effort – suitable for standard binary situations, changing environment situations, etc.
Other multi-core issues Ease development for a multi-core system
Designer writes an application without specialization for multi-core and the application is transparently mapped to a multi-core system
Architectural support for debugging - i.e. shared resources
Ann Gordon-Ross, UC Riverside 45 of 45
Publications Journal Papers
Frequent Loop Detection Using Non-Intrusive On-Chip Hardware A. Gordon-Ross, F. Vahid, IEEE Transactions on Computing - Best of the 2003 MICRO and CASES conferences special issue. Special Issue-Embedded Systems, Microarchitecture, and Compilation Techniques, in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005, Vol. 54, Issue 10, pp 1203-1215.
Tiny Instruction Caches For Low Power Embedded Systems A. Gordon-Ross, S. Cotterell, F. Vahid, ACM Transactions on Embedded Computing Systems, Vol. 2, Issue 4, Nov. 2003, pp. 449-481.
Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example A. Gordon-Ross, S. Cotterell, F. Vahid, IEEE Computer Architecture Letters, Vol I, January 2002.
Conference Papers A One-Shot Configurable-Cache Tuner for Improved Energy and Performance A. Gordon-
Ross, P. Viana, F. Vahid, W. Najjar, E. Barros. IEEE/ACM DATE, April 2007. Configurable Cache Subsetting for Fast Cache Tuning P. Viana, A. Gordon-Ross, E. Keogh, E.
Barros, F. Vahid. IEEE DAC, July 2006 Fast Configurable-Cache Tuning with a Unified Second-Level Cache A. Gordon-Ross, F.
Vahid, N. Dutt. IEEE/ACM ISLPED, August 2005 A First Look at the Interplay of Code Reordering and Configurable Caches A. Gordon-Ross,
F. Vahid, N. Dutt. ACM GLSVLSI April 2005. Automatic Tuning of Two-Level Caches to Embedded Applications A. Gordon-Ross, F.
Vahid, N. Dutt IEEE/ACM DATE, February 2004. Frequent Loop Detection Using Non-Intrusive On-Chip Hardware A. Gordon-Ross, F. Vahid,
IEEE/ACM CASES, October 2003. Dynamic Loop Caching Meets Preloaded Loop Caching -- A Hybrid Approach A. Gordon-
Ross, F. Vahid, IEEE ICCD, September 2002. A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power F. Vahid,
A. Gordon-Ross, IEEE/ACM ISLPED, August 2001.