Upload
aine
View
60
Download
1
Embed Size (px)
DESCRIPTION
Prefetch -Aware DRAM Controllers. Chang Joo Lee Onur Mutlu* Veynu Narasiman Yale N. Patt. Electrical and Computer Engineering The University of Texas at Austin. *Microsoft Research and Carnegie Mellon University. Outline. Motivation Mechanism Experimental Evaluation Conclusion. - PowerPoint PPT Presentation
Citation preview
1
Prefetch-Aware DRAM ControllersPrefetch-Aware DRAM Controllers
Chang Joo LeeChang Joo Lee
Onur Mutlu*Onur Mutlu*
Veynu NarasimanVeynu Narasiman
Yale N. PattYale N. Patt
Electrical and Computer Engineering Electrical and Computer Engineering The University of Texas at AustinThe University of Texas at Austin
*Microsoft Research and Carnegie Mellon University
2
OutlineOutline
MotivationMotivation MechanismMechanism Experimental EvaluationExperimental Evaluation ConclusionConclusion
3
Modern DRAM SystemsModern DRAM Systems
Rows and columns of DRAM cellsRows and columns of DRAM cells A A row bufferrow buffer in each bank in each bank Non-uniform access latency:Non-uniform access latency:
Row-hit: Data is in the row buffer
Row-conflict: Data is not in the row buffer Needs to access the DRAM cells
Row-hit latency << Row-conflict latency
Prioritize row-hit accesses to increase DRAM throughput [Rixner et al. ISCA2000]
DRAM Bank
Row Buffer
Data Bus
Row A
Processor: Row A
Row-hit
Processor: Row B
Row B
Row-conflict
4
Problems of Prefetch HandlingProblems of Prefetch Handling
How to schedule prefetches vs demands? Demand-first: Always prioritizes demands over
prefetch requests Demand-prefetch-equal: Always treats them the same
Neither take into account both:
1. Non-uniform access latency of DRAM systems
2. Usefulness of prefetches
Neither of these perform best
5
When Prefetches are UsefulWhen Prefetches are Useful
Row A
Pref Row A : X
Dem Row B : Y
Pref Row A : Z
DRAM Controller
Row Buffer
DRAM
DRAM
Processor
Demand-first Demand-first
Row-conflict
Row B
Row-hit
Miss Y Miss X Miss Z
Stall Execution
Processor needs Y, X, and Z
2 row-conflicts, 1 row-hit
6
When Prefetches are UsefulWhen Prefetches are Useful
Row A
Pref Row A : X
Dem Row B : Y
Pref Row A : Z
DRAM Controller
Row Buffer
DRAM
DRAM
Processor
DRAM
Processor
Demand-firstDemand-first
Demand-pref-equalDemand-pref-equal
Row-hitRow-conflict
Saved Cycles
Row B
Miss Y Miss X Miss Z
Miss Y Hit X Hit Z
Demand-pref-equal outperforms demand-first
Stall Execution
Processor needs Y, X, and Z
2 row-conflicts, 1 row-hit
2 row-hits, 1 row-conflict
7
When Prefetches are UselessWhen Prefetches are Useless
Row A
Pref Row A : X
Dem Row B : Y
Pref Row A : Z
DRAM Controller
Row Buffer
DRAM
DRAM
Processor
DRAM
Processor
Demand-firstDemand-first
Demand-pref-equalDemand-pref-equal
Saved CyclesMiss Y
Miss Y
Demand-first outperforms demand-pref-equal
Y X Z
X Z Y
Processor needs ONLY Y
8
Demand-first vs. Demand-pref-equal policyDemand-first vs. Demand-pref-equal policy
Stream prefetcher enabled
0
0.5
1
1.5
2
2.5
3
galgel
amm
p
artm
ilcswim
libquantum
bwaves
leslie3d
IPC
no
rma
lize
d to
no
pre
fetc
hin
g
Demand-first
Demand-pref-equal
Demand-first is betterDemand-pref-equal is betterGoal 1: Adaptively schedule prefetches based on prefetch usefulnessGoal 2: Eliminate useless prefetches
Useless prefetches:
Off-chip bandwidth
Queue resources
Cache Pollution
9
GoalsGoals
1. Maximize the benefits of prefetching1. Maximize the benefits of prefetching::Increase DRAM throughput by adaptively Increase DRAM throughput by adaptively scheduling requests based on prefetch usefulnessscheduling requests based on prefetch usefulness
→ → increase timeliness of useful prefetchesincrease timeliness of useful prefetches
2. Minimize the harm of prefetching:2. Minimize the harm of prefetching:Adaptively delay the service of useless Adaptively delay the service of useless prefetches and remove useless prefetchesprefetches and remove useless prefetches
→ → increase efficiency of resource utilizationincrease efficiency of resource utilization
Achieve higher performance and efficiencyAchieve higher performance and efficiency
10
OutlineOutline
Motivation MechanismMechanism Experimental Evaluation Conclusion
11
Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers (PADC)(PADC)
Adaptive Prefetch Scheduling Adaptive Prefetch Scheduling (APS): (APS): Prioritizes prefetch and Prioritizes prefetch and demand requests based on prefetch demand requests based on prefetch accuracy estimationaccuracy estimation
Adaptive Prefetch Dropping Adaptive Prefetch Dropping (APD): (APD): Cancels likely-useless Cancels likely-useless prefetches from memory request prefetches from memory request buffer based on prefetch accuracybuffer based on prefetch accuracy
APS
APD
Memory request buffer
Update
Requestpriority
Drop
Request Info
Prefetch accuracy from each core
To DRAM
PADCPADC
12
Prefetch Accuracy EstimationPrefetch Accuracy Estimation
Prefetch accuracy = Prefetch accuracy =
Hardware support:Hardware support: Prefetch bit (per L2 cache line, MSHR entry): Prefetch bit (per L2 cache line, MSHR entry):
Indicates whether it is a prefetch or demandIndicates whether it is a prefetch or demand Prefetch sent counter (per core)Prefetch sent counter (per core) Prefetch used counter (per core)Prefetch used counter (per core) Prefetch accuracy register (per core)Prefetch accuracy register (per core)
Estimated every 100K cyclesEstimated every 100K cycles
#Prefetches used#Prefetches used
#Prefetches sent#Prefetches sent
13
Adaptive Prefetch Scheduling (APS)Adaptive Prefetch Scheduling (APS)
1. Adaptively change the priority of prefetch requests1. Adaptively change the priority of prefetch requests Low prefetch accuracy → Low prefetch accuracy → prioritize demands from the coreprioritize demands from the core High prefetch accuracy High prefetch accuracy → → treat demands and prefetches equally treat demands and prefetches equally
2. In a CMP system: prioritize demand requests from a core 2. In a CMP system: prioritize demand requests from a core
that has many useless prefetchesthat has many useless prefetches To avoid starving demand requests from a core with low prefetch To avoid starving demand requests from a core with low prefetch
accuracy → improves system performance accuracy → improves system performance
APS
APD
Memory request buffer
Update
Requestpriority
Drop
Request Info
Prefetch accuracy from each core
To DRAM
PADCPADC
14
Adaptive Prefetch Scheduling (APS)Adaptive Prefetch Scheduling (APS)
1. Critical requests 1. Critical requests All demand requestsAll demand requests Prefetch requests from cores whose Prefetch requests from cores whose
prefetch accuracy ≥ promotion thresholdprefetch accuracy ≥ promotion threshold
2. Urgent requests2. Urgent requests Demand requests from cores whose Demand requests from cores whose
prefetch accuracy < promotion thresholdprefetch accuracy < promotion threshold
15
Adaptive Prefetch Scheduling (APS)Adaptive Prefetch Scheduling (APS)
Prioritization order:Prioritization order:1. Critical request (C)1. Critical request (C)
2. Row-hit request (RH)2. Row-hit request (RH)
3. Urgent request (U)3. Urgent request (U)
4. Oldest request (FCFS)4. Oldest request (FCFS)
C RH U FCFS
Each memory request buffer entry: priority fields Each memory request buffer entry: priority fields
16
Adaptive Prefetch Dropping (APD)Adaptive Prefetch Dropping (APD)
Proactively drops old prefetches based on prefetch Proactively drops old prefetches based on prefetch accuracy estimationaccuracy estimation Old requests are likely uselessOld requests are likely useless
APS prioritizes demand requests when prefetch accuracy is lowAPS prioritizes demand requests when prefetch accuracy is low A prefetch that is hit by a demand is promoted to a demand A prefetch that is hit by a demand is promoted to a demand
Dropping old, useless prefetches saves resourcesDropping old, useless prefetches saves resources(bandwidth, queues, caches)(bandwidth, queues, caches) Saved resources can be used by useful requestsSaved resources can be used by useful requests
APS
APD
Memory request buffer
Update
Requestpriority
Drop
Request Info
Prefetch accuracy from each core
To DRAM
PADCPADC
17
Adaptive Prefetch Dropping (APD)Adaptive Prefetch Dropping (APD)
Prefetch bit (P)Prefetch bit (P) Core ID field (ID)Core ID field (ID) Age field (AGE) Age field (AGE)
Drop prefetch requests whoseDrop prefetch requests whoseAGE > Drop thresholdAGE > Drop threshold
Drop threshold is dynamically determined based on Drop threshold is dynamically determined based on prefetch accuracy estimationprefetch accuracy estimation Lower accuracy Lower accuracy →→ Lower threshold Lower threshold
P ID AGE
Each memory request buffer entry: drop informationEach memory request buffer entry: drop information
18
Hardware Cost for 4-core CMP
Cost (bits)Cost (bits)
Prefetch Accuracy Prefetch Accuracy EstimationEstimation
33,056
APSAPS 128
APDAPD 1,536
Total 34,720
Total storage: 34,720 bits (~4.25KB) are neededTotal storage: 34,720 bits (~4.25KB) are needed ~ ~ 4KB are prefetch bits in each cache line4KB are prefetch bits in each cache line If prefetch bits are already implemented: ~228BIf prefetch bits are already implemented: ~228B
Logic is not on the critical pathLogic is not on the critical path Scheduling and dropping decisions are made every DRAM bus cycleScheduling and dropping decisions are made every DRAM bus cycle
19
OutlineOutline
Motivation Mechanism Experimental EvaluationExperimental Evaluation Conclusion
20
Simulation MethodologySimulation Methodology
x86 cycle accurate simulatorx86 cycle accurate simulator Baseline processor configuration Baseline processor configuration
Per corePer core 4-wide issue, out-of-order, 256-entry ROB4-wide issue, out-of-order, 256-entry ROB 512KB, 8-way unified L2 cache (1MB for single core processor) 512KB, 8-way unified L2 cache (1MB for single core processor) Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64)Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64)
SharedShared On-chip, demand-first FR-FCFS memory controllerOn-chip, demand-first FR-FCFS memory controller 64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core DDR3 1333, 15-15-15ns, 4KB row buffer DDR3 1333, 15-15-15ns, 4KB row buffer
PADC configurationPADC configuration Promotion threshold: 85%Promotion threshold: 85% Drop threshold:Drop threshold: Prefetch accuracy (%) 0~10 10~30 30~70 70~100
Threshold (core cycles) 100 1,500 50,000 100,000
21
Workloads for EvaluationWorkloads for Evaluation
Single-core processor:Single-core processor:All 55 SPEC 2000/2006 benchmarksAll 55 SPEC 2000/2006 benchmarks Single-threaded Single-threaded 38 prefetch sensitive benchmarks38 prefetch sensitive benchmarks 17 prefetch insensitive benchmarks17 prefetch insensitive benchmarks
CMP:CMP:Randomly chosen multiprogrammed workloads from 55 Randomly chosen multiprogrammed workloads from 55 benchmarks:benchmarks: 4-core CMP: 32 workloads 4-core CMP: 32 workloads 8-core CMP: 21 workloads 8-core CMP: 21 workloads
22
Performance of PADCPerformance of PADC
Single-core
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to d
eman
d-fir
st
No-pref
Demand-first
Demand-pref-equal
PADC
4-core CMP
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to d
eman
d-fir
st
No-pref
Demand-first
Demand-pref-equal
PADC
8-core CMP
0
0.2
0.4
0.6
0.8
1
1.2
AverageN
orm
aliz
ed to
dem
and-
first
No-pref
Demand-first
Demand-pref-equal
PADC
4.3% 8.2% 9.9%
23
Bus Traffic of PADCBus Traffic of PADC
Single-core
0
0.5
1
1.5
2
2.5
3
3.5
Average
Mill
ion
cach
e lin
es
No-pref
Demand-first
Demand-pref-equal
PADC
4-core CMP
0
2
4
6
8
10
12
Average
Mill
ion
cach
e lin
es
No-pref
Demand-first
Demand-pref-equal
PADC
8-core CMP
0
2
4
6
8
10
12
14
16
18
20
AverageM
illio
n ca
che
lines
No-pref
Demand-first
Demand-pref-equal
PADC
-10.4% -10.7% -9.4%
24
Performance with Other PrefetchersPerformance with Other Prefetchers
6.0% 6.6% 2.2%
Stride
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to n
o pr
efet
chin
g
No-pref
Demand-first
PADC
GHB
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to n
o pr
efet
chin
g
No-pref
Demand-first
PADC
Markov
0
0.2
0.4
0.6
0.8
1
1.2
AverageN
orm
aliz
ed to
no
pref
etch
ing
No-pref
Demand-first
PADC
4-core CMP
25-5.7% -6.8% -10.3%
Stride
0
2
4
6
8
10
12
Average
Mill
ion
cach
e lin
es
No-pref
Demand-first
PADC
GHB
0
2
4
6
8
10
12
Average
Mill
ion
cach
e lin
es
No-pref
Demand-first
PADC
Markov
0
2
4
6
8
10
12
AverageM
illio
n ca
che
lines
No-pref
Demand-first
PADC
4-core CMP
Bus Traffic with Other PrefetchersBus Traffic with Other Prefetchers
26
OutlineOutline
Motivation Mechanism Experimental Evaluation ConclusionConclusion
27
ConclusionsConclusions
Prefetch-Aware DRAM Controllers (PADC)Prefetch-Aware DRAM Controllers (PADC) Adaptive Prefetch SchedulingAdaptive Prefetch Scheduling
Increase DRAM throughput by exploiting row-buffer locality when Increase DRAM throughput by exploiting row-buffer locality when prefetches are usefulprefetches are useful
Delay service of prefetches when they are uselessDelay service of prefetches when they are useless
Adaptive Prefetch Dropping Adaptive Prefetch Dropping With APS, remove useless prefetches effectively while keeping the With APS, remove useless prefetches effectively while keeping the
benefits of useful prefetchesbenefits of useful prefetches
Improve performance and bandwidth efficiency for both Improve performance and bandwidth efficiency for both single-core and CMP systemssingle-core and CMP systems
Low cost and easily implementableLow cost and easily implementable
28
Questions?Questions?
29
Performance Detail
Single-core: 38 prefetch-sensitive: 6.2%
Prefetch-friendly: 29 benchmarks Prefetch-unfriendly: 9 benchmarks 17 out of 38 are memory intensive
(MPKI > 10) : 11.8% 17 prefetch-insensitive
30
Two Channel Memory Two Channel Memory PerformancePerformance
5.9% 5.5%
4-core CMP
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to d
eman
d-fir
st
1ch-demand-first
No-pref
Demand-first
Demand-pref-equal
PADC
8-core CMP
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to d
eman
d-fir
st
1ch-demand-first
No-pref
Demand-first
Demand-pref-equal
PADC
31%16%
31
Two Channel Memory Two Channel Memory Bus TrafficBus Traffic
-12.9% -13.2%
4-core CMP
0
2
4
6
8
10
12
Average
Mill
ion
cach
e lin
es
No-pref
Demand-first
Demand-pref-equal
PADC
8-core CMP
0
2
4
6
8
10
12
14
16
18
20
Average
Mill
ion
cach
e lin
es
No-pref
Demand-first
Demand-pref-equal
PADC
32
Comparison with Feedback Comparison with Feedback Directed PrefetchingDirected Prefetching
6.4%
Performance
0
0.2
0.4
0.6
0.8
1
1.2
Average
Nor
mal
ized
to d
eman
d-fir
st
Demand-first
fdp-demand-first
apd-demand-first
fdp-demand-pref-equal
fdp-aps
PADC(aps-apd)
Bus traffic
0
2
4
6
8
10
12
Average
Mill
ion
cach
e lin
es
Demand-first
fdp-demand-first
apd-demand-first
fdp-demand-pref-equal
fdp-aps
PADC(aps-apd)
4-core CMP
33
Performance on Single-CorePerformance on Single-Core
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
galgel
amm
p
omnetpp
artm
ilcswim
libquantum
bwaves
leslie3d
soplex
gmean
No
rma
lize
d IP
C to
de
ma
nd
-fir
st
No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)
34
Prefetch Friendly ApplicationPrefetch Friendly Application
libquantumlibquantumBus traffic
0
0.5
1
1.5
2
2.5
3
Mill
ion
cac
he
lines
UselessUseful
Demand
Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
No
rmal
ized
IPC
to d
eman
d-f
irst
35
Prefetch Unfriendly ApplicationPrefetch Unfriendly Application
artartBus traffic
0
5
10
15
20
25
30
35
Mill
ion
cac
he
lines
UselessUseful
Demand
Performance
0
0.2
0.4
0.6
0.8
1
1.2
No
rmal
ized
IPC
to d
eman
d-f
irst
36
Average Performance on Single-CoreAverage Performance on Single-Core
All 55 SPEC 2000/2006 CPU benchmarksAll 55 SPEC 2000/2006 CPU benchmarksBus traffic
0
0.5
1
1.5
2
2.5
3
3.5
Mill
ion
cac
he
lines
Performance
0
0.2
0.4
0.6
0.8
1
1.2
No
rmal
ized
IPC
to d
eman
d-f
irst
37
System Performance on 4-Core CMPSystem Performance on 4-Core CMP
32 randomly chosen 4-core workloads32 randomly chosen 4-core workloads
System performance
0
0.5
1
1.5
2
2.5
3
3.5
WS HS
Met
ric
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
Average bus traffic
0
2
4
6
8
10
12
14
16
18
20
TrafficM
illio
n c
ach
e lin
es
No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)
38
System Performance on 8-core CMPSystem Performance on 8-core CMP
21 randomly chosen 8-core workloads21 randomly chosen 8-core workloads
System performance
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
WS HS
Met
ric
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
Average bus traffic
0
2
4
6
8
10
12
14
16
18
20
Traffic
Mill
ion
cac
he
lines
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
39
Prefetch Friendly ApplicationPrefetch Friendly Application
leslie3dleslie3dBus traffic
0
1
2
3
4
5
6
Mill
ion
cac
he
lines
UselessUseful
Demand
Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
No
rmal
ized
IPC
to d
eman
d-f
irst
40
Prefetch Unfriendly ApplicationPrefetch Unfriendly Application
ammpammpBus traffic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Mill
ion
cac
he
lines Useless
Useful
Demand
Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
No
rmal
ized
IPC
to d
eman
d-f
irst
41
Performance on 4-CorePerformance on 4-Core
omnetpp, libquantum, galgel, and GemsFDTD on 4-core omnetpp, libquantum, galgel, and GemsFDTD on 4-core CMP CMP Individual speedup
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
omnetpp libquantum galgel GemsFDTD
Sp
eed
up
to s
ing
le a
pp
licat
ion
ru
n IP
C No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)
System performance
0
0.5
1
1.5
2
2.5
WS HS
Met
ric
No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)
42
Performance on 4-CorePerformance on 4-Core
omnetpp, libquantum, galgel, and GemsFDTD on 4-core omnetpp, libquantum, galgel, and GemsFDTD on 4-core CMP CMP
Individual speedup
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Sp
eed
up
to s
ing
le a
pp
licat
ion
ru
n IP
C
No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)
0
1
2
3
4
5
6
7
No-p
ref
Dem
and-first
Dem
and-pref-eq
ual
AP
S-o
nly
AP
S-A
PD
(PA
DC
)
No-p
ref
Dem
and-first
Dem
and-pref-eq
ual
AP
S-o
nly
AP
S-A
PD
(PA
DC
)
No-p
ref
Dem
and-first
Dem
and-pref-eq
ual
AP
S-o
nly
AP
S-A
PD
(PA
DC
)
No-p
ref
Dem
and-first
Dem
and-pref-eq
ual
AP
S-o
nly
AP
S-A
PD
(PA
DC
)
omnetpp libquantum galgel GemsFDTD
Mill
ion
ca
che
lin
es
Useless
Useful
Demand
43
System Performance on 4-CoreSystem Performance on 4-Core
omnetpp, libquantum, galgel, and GemsFDTDomnetpp, libquantum, galgel, and GemsFDTD
System performance
0
0.5
1
1.5
2
2.5
WS HS
Met
ric No-pref
Demand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)
Bus traffic
0
2
4
6
8
10
12
14
16
18
Mill
ion
cac
he
lines
UselessUsefulDemand