Predictor-Directed Stream Buffers
Timothy Sherwood
Suleyman Sair
Brad Calder
Sherwood, Sair, and Calder 2
Overview
• Introduction
• Past Stream Buffer work
• Predictor-Directed Stream Buffers
• Policy Improvements
• Results
• Contribution
Sherwood, Sair, and Calder 3
Introduction
• Memory Wall
• Latency reduction through prefetching– without eating too much bandwidth
• Stream Buffers are one of the most used– simple to implement– very efficient
• Pointer based codes
Sherwood, Sair, and Calder 4
Past Stream Buffer work
• Jouppi 1990 – consecutive cache line FIFO
• Palacharla and Kessler 1994– non-unit stride (based on memory chunk)– allocation filters
• Farkas et. al. 1997– PC-based stride– fully associative / non-overlapping
Sherwood, Sair, and Calder 5
Past Stream Buffer work
tag cache block comparator
• • •
PredictedStride
LastAddress
tag cache block comparator
from/to next lower level of memoryN buffe
rs
store predict_stridein streaming buffer
on allocation
to data cache, register file, and MSHRs
Sherwood, Sair, and Calder 6
Past Stream Buffer work
• Past work targeted at streaming in arrays– either in sequential order– or stride order (multidimensional array)
• Could not handle Pointer Codes– repetitive non-striding references
• Need a more General Predictor
Sherwood, Sair, and Calder 7
Predictor-Directed Stream Buffer
• The Goal: Simple and efficient hardware based prefetching of complex but predictable streams
• Approach: Take a general predictor and hook it up to the well established stream buffer front end.
• Separate the predictor from the prefetcher• Can use almost any predictor
– 2 Delta– Context– Markov
Sherwood, Sair, and Calder 8
PSB Generalized Architecture
Load PCHistoryStride
ConfidenceLast Address
Prediction Info
tag cache block comparator• • •
tag cache block comparator
AddressPredictor
load info (PC, address)fromwrite-backstage
from/to next lower level of memory
subset of prediction info
predicted address
predicted address
N buffers
to data cache, register file, and MSHRs
updateprediction
information
Sherwood, Sair, and Calder 9
PSB Stages
• Allocation
• Prediction
• Probe
• Prefetching
• Lookup
Sherwood, Sair, and Calder 10
Stage Descriptions
• Allocation– Stream Buffer is allocated to a particular load– the buffer is initialized– subject to Allocation Filters
• Prediction– an empty buffer entry asks for an address– subject to limited predictor speed.
Sherwood, Sair, and Calder 11
Stage Descriptions (Continued)
• Probe– if there are free ports remove useless prefetches
– not mandatory
• Prefetching– subject to scheduling for ports and priority, prefetches
are sent to memory
• Lookup– when a load performs an L1 access, the Stream Buffers
are checked in parallel
Sherwood, Sair, and Calder 12
PSB Implementation
• Tried many different address predictors
• Best is Stride Filtered Markov– similar to Joseph and Grunwald’s Predictor– first order Markov– striding behavior is filtered out
• Difference is stored to reduce size
Sherwood, Sair, and Calder 13
Difference Storing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 3 5 7 9 11 13 15 17 19
Number of bits
Perc
en
t o
f L
1 M
isses
burgdeltagssisturb3dhealth
Sherwood, Sair, and Calder 14
PSB with SFM
tag cache block comparator• • •
tag cache block comparator
from/to next lower level of memory
predictedaddress
last address
if hit, returnpredicted address
8 buffers
store predictedstride in
streaming buffer on allocation
MarkovPredictor
load info (PC, address)from write-back stage
StridePredictor
MUXmarkov
hit?
PredictedStride
LastAddress
predicted markov address
predicted stride address
to data cache, register file, and MSHRs
Sherwood, Sair, and Calder 15
Methods
• SimpleScalar 3.0• Rewrote memory hierarchy• Model bandwidth between all levels• Added perfect store sets• Ran over set of Pointer Benchmarks• 2K entry predictor table• 8 buffers x 4 entry Stream Buffers• 32k 4-way associative cache
Sherwood, Sair, and Calder 16
Speedup from PSB
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
health burg deltablue gs sis turb3d
Per
cen
t S
pee
du
p
PC-StridePSB w/ SFM
Sherwood, Sair, and Calder 17
Allocation Filtering
• Farkas et.al. showed how two miss filtering– prevents too many streams requesting resources
• Does not work as well for pointer codes– irregular miss patterns
• We use Priority and Accuracy Counters– track behavior of Loads– allocate to Loads that are Behaving well
Sherwood, Sair, and Calder 18
Allocation Filtering Speedup
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
health burg deltablue gs sis turb3d
Per
cen
t S
pee
du
p
PC-Stride2 MissConfAlloc
Sherwood, Sair, and Calder 19
Stream Buffer Priority
• Round Robin– give each active buffer equal resources– predictor and prefetching
• Priority Counters– uses small counters with each buffer– use the counters to rank buffer– more resources to better performing buffers
Sherwood, Sair, and Calder 20
Priority Scheduling Speedup
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
health burg deltablue gs sis turb3d
Per
cen
t S
pee
du
p
PC-Stride
2Miss-RR
2Miss-Priority
ConfAlloc-RR
ConfAlloc-Priority
Sherwood, Sair, and Calder 21
Latency Reduction
0
2
4
6
8
10
12
health burg deltablue gs sis turb3d
Avg
. Acc
ess
Lat
ency
(cyc
les)
BasePC-Stride2Miss-RR2Miss-PriConf-RRConfAlloc-Priority
Sherwood, Sair, and Calder 22
Contributions
• Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation
• Using accuracy based allocation filtering and priority scheduling can make a large difference in performance
• With some simple compression, even small Markov tables can be very effective
Sherwood, Sair, and Calder 23
Accuracy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
health burg deltablue gs sis turb3d
Per
cen
t Acc
ura
cy
PC-Stride2Miss-RR2Miss-PriorityConfAlloc-RRConfAlloc-Priority
Sherwood, Sair, and Calder 24
Bus Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%B
ase
PC
-Str
ide
2M
iss-
RR
2M
iss-
Pri
Co
nf-
RR
Co
nf-
Pri
Ba
seP
C-S
trid
e2
Mis
s-R
R2
Mis
s-P
riC
on
f-R
RC
on
f-P
riB
ase
PC
-Str
ide
2M
iss-
RR
2M
iss-
Pri
Co
nf-
RR
Co
nf-
Pri
Ba
seP
C-S
trid
e2
Mis
s-R
R2
Mis
s-P
riC
on
f-R
RC
on
f-P
riB
ase
PC
-Str
ide
2M
iss-
RR
2M
iss-
Pri
Co
nf-
RR
Co
nf-
Pri
Ba
seP
C-S
trid
e2
Mis
s-R
R2
Mis
s-P
riC
on
f-R
RC
on
f-P
ri
health burg deltablue gs sis turb3d
L1
to L
2 B
us
Uti
liza
tio
n
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
L2
to M
em B
us
Uti
liza
tio
n
L1 to L2 Bus UtilizationL2 to Mem Bus Utilization