Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Back to the Future: Leveraging Belady’s Algorithm for Improved Cache ReplacementAkanksha JainCalvin Lin
1
Cache Replacement
• On a cache miss, which line should we evict? – Well-studied problem
• Belady provided an optimal solution in 1966
2
Belady’s Optimal Algorithm (OPT)
• Evict the line that is accessed farthest in the future – Impractical
3
Existing Solutions• Rely on heuristics that make assumptions about
the underlying access pattern – LRU assumes recency-friendly accesses – MRU assumes thrashing accesses – Recent solutions use more sophisticated heuristics
• Problem: Heuristics don’t work well when their assumptions don’t hold
4
* = AMedium-term Reuse
CShort-term Reuse
BLong-term Reuse
Example: Matrix Multiplication
5 LRU MRU DRRIP SDBP SHiP OPT
Hit
Rate
LRU MRU DRRIP SDBP SHiP OPT
Hit
Rate
B (Long-term) A (Medium-term) C (Short-term)
Example: Matrix Multiplication
6
* = AMedium-term Reuse
CShort-term Reuse
BLong-term Reuse
Significant Headroom
0
5
10
15
20
25
30
DIP
DR
RIP
DSB
SDB
P
SHiP
MPK
I Red
uctio
n ov
er L
RU (%
) Cache Replacement for SPEC
Opt
imal
2007 2010 2010 2010 2011 7
We are going to use OPT
Our Solution: Key Idea• We cannot look into the future • But we can apply the OPT algorithm to past events to
learn how OPT behaves • If past predicts the future, then this solution should
approach OPT
time
Future Past Behavior
What should I evict? 8
Predictor
Complication
• OPT looks arbitrarily far into the future – We might need an arbitrarily long history
9
0
10
20
30
40
50
60
70
1× 2× 4× 8× 16×
Erro
r com
pare
d to
Infin
ite O
PT (
%)
View of the Future (number of cache accesses)
Belady
LRU
How far in the future does OPT need to look?
10
History length not unbounded, but we need to track past 8× cache accesses
Our Solution
• Hawkeye Cache Replacement (Hawkeye) – Hawks can see 8× farther than the best humans
11
Hawkeye: Challenges• Need to look at a long history (8× the cache size)
• Need to efficiently compute the OPT solution for a long history of past references
• New algorithm called OPTgen – Online (linear time) – Sampling (12KB overhead for 2MB cache)
12
OPTgen PC-based Predictor
13
Hawkeye: Overall Design
Last Level Cache
Computes OPT’s decisions for the
past
Remembers past OPT decisions
Cache Access Stream
OPT hit/miss
Insertion Priority
PC
OPTgen PC-based Predictor
Address: X
14
Hawkeye: Overall Design
With OPT, would X be a cache hit or miss?
Hit/Miss
Training
Last Level Cache
X
PC
OPTgen PC-based Predictor
Insert X with high or low
priority
Last Level Cache15
Hawkeye: Overall Design
Does this PC tend to load cache-friendly or cache-averse lines?
Address: XPredictionPC
OPTgen• Simple online algorithm that reproduces OPT’s
solution for the past
• Inspired from Belady’s insight – Lines that are reused first have higher priority
• OPTgen also gives higher priority to lines that are reused first
16
OPTgen• For each line, we ask if this line would have been a
hit or miss with OPT.
A B C C E D D A
time Future Past Behavior 17
Cache Capacity = 2
D
C
A
OPTgen• For each line, we ask if this line would have been a
hit or miss with OPT.
time Future Past Behavior 18
Hit
A B C C E D D ACache Capacity = 2
D
C
A
OPTgen• For each line, we ask if this line would have been a
hit or miss with OPT.
19
Miss
D
time Future Past Behavior
A B C C E D D A BCache Capacity = 2
D
C
A
B
OPTgen is equivalent to OPT• 100% accuracy with unlimited history
– 95.5% accuracy with 8× history (for SPEC 2006)
20
OPTgen Algorithm: Insight 1• OPT hit/miss can be determined at the time of reuse
21
Don’t need information about future accesses because they will have lower priority than A
A B C C E D D A B
D
C
time Future Past Behavior
A
OPTgen Algorithm: Insight 2• OPT solution can be reproduced by tracking
occupancy rather than cache contents
22
OPTgen Algorithm: Insight 2
23
B
D
A B C C E D D A BCache Capacity = 2
D
C
1 1 2 1 1 2 1
time Future Past Behavior
A
OPTgen Algorithm: Insight 3
• OPT considers both reuse distances and their overlap
24
E
D
A B C C E D D A ECache Capacity = 2
D
C
time Future Past Behavior
E’s reuse distance is smaller than A, but it misses with OPT because it has higher overlap
A
Sampling for OPTgen
• 8× history for a 2MB cache – 100K entries (> 0.5MB)
• Set Dueling [Qureshi et al., ISCA 07] – Sample the behavior of a
few cache sets – 64 sampled sets
• Set Dueling with OPTgen – Apply OPTgen to 64
sampled sets – 2.5K entries (12KB) – 95% accuracy in
estimating miss rate
25
OPTgen" PC-based Predictor"
26
Hawkeye: Overall Design
Last Level Cache"
95% accurate
Cache Access Stream
OPT hit/miss
Insertion Priority
Evaluation• Cache Replacement Championship Simulator (CRC) • Benchmarks: Memory-intensive SPEC 2006 • Hardware Configurations
– Single Core: Last-level Cache: 16-way, 2MB LLC – Multi-core: Shared Cache: 16-way, 4MB & 8MB LLC
• Replacement policies – Baseline: LRU – DRRIP [2010] – SDBP [2011] – SHiP [2012]
27
-30 -20 -10
0 10 20 30 40 50 60
asta
r gr
omac
s go
bmk
lbm
om
netp
p ca
lcul
ix
gcc
lesl
ie
zeus
lib
q bz
ip
h264
to
nto
hmm
er
xala
nc
mcf
so
plex
ge
ms
cact
us
sphi
nx3
Mea
n
Mis
s Red
uctio
n ov
er L
RU
(%)
DRRIP SHIP SDBP
28
Miss Reduction
-30 -20 -10
0 10 20 30 40 50 60
asta
r gr
omac
s go
bmk
lbm
om
netp
p ca
lcul
ix
gcc
lesl
ie
zeus
lib
q bz
ip
h264
to
nto
hmm
er
xala
nc
mcf
so
plex
ge
ms
cact
us
sphi
nx3
Mea
n
Mis
s Red
uctio
n ov
er L
RU
(%)
DRRIP SHIP SDBP Hawkeye
17.4%
Hawkeye outperforms DRRIP, SDBP and SHiP 29
Miss Reduction
-30 -20 -10
0 10 20 30 40 50 60
asta
r gr
omac
s go
bmk
lbm
om
netp
p ca
lcul
ix
gcc
lesl
ie
zeus
lib
q bz
ip
h264
to
nto
hmm
er
xala
nc
mcf
so
plex
ge
ms
cact
us
sphi
nx3
Mea
n
Mis
s Red
uctio
n ov
er L
RU
(%)
DRRIP SHIP SDBP Hawkeye
17.4%
Hawkeye does not result in a slowdown for any benchmark 30
Miss Reduction
-30 -20 -10
0 10 20 30 40 50 60
asta
r gr
omac
s go
bmk
lbm
om
netp
p ca
lcul
ix
gcc
lesl
ie
zeus
lib
q bz
ip
h264
to
nto
hmm
er
xala
nc
mcf
so
plex
ge
ms
cact
us
sphi
nx3
Mea
n
Mis
s Red
uctio
n ov
er L
RU
(%)
DRRIP SHIP SDBP Hawkeye
17.4%
Hawkeye’s performance gains are consistent across benchmarks 31
Miss Reduction
-30 -20 -10
0 10 20 30 40 50 60 70
asta
r gr
omac
s go
bmk
lbm
om
netp
p ca
lcul
ix
gcc
lesl
ie
zeus
lib
q bz
ip
h264
to
nto
hmm
er
xala
nc
mcf
so
plex
ge
ms
cact
us
sphi
nx3
Mea
n
Mis
s Red
uctio
n ov
er L
RU
(%)
DRRIP SHIP SDBP Hawkeye OPT
32
Miss Reduction
-20
-10
0
10
20
30
40
50
asta
r"gr
omac
s"go
bmk"
lbm"
omne
tpp"
calc
ulix"
gcc"
lesl
ie"
zeus"
libq"
bzip
2"h2
64"
tont
o"hm
mer"
xala
n"m
cf"
sopl
ex"
gem
s"ca
ctus"
sphi
nx"
Geo
mea
n"
Spee
dup
(%)"
DRRIP SHiP SDBP Hawkeye
8.4%
33
Speedup
Multi-Core Results
0"
5"
10"
15"
20"
25"
30"
35"
1" 2" 4"
MPK
I Red
uctio
n ov
er L
RU
(%)"
Number of Cores"
SDBP"
SHiP"
Hawkeye"
Speedup = 8.4%
Speedup = 13.5%
Speedup = 15.0%
Averaged across 100s of multi-programmed SPEC runs 34
Hawkeye: Summary
• New goal: learn from the OPT solution
• Not limited to specific access patterns
• Models both reuse distance and demand 0
5
10
15
20
25
30
DIP
DR
RIP
DSB
SDB
P
SHiP
OPT
Haw
keye
MPK
I Red
uctio
n ov
er L
RU (%
)
‘07 ‘10 ‘10 ‘10 ‘11 2015
35
Conclusions and Future Work• Recent trend to view cache replacement as a
prediction problem – Learn from the past to predict the future
• Hawkeye learns from an oracle
• Future Work: More sophisticated predictors to learn the OPT solution for the past
36