Prefetch -Aware DRAM Controllers

1

Prefetch-Aware DRAM ControllersPrefetch-Aware DRAM Controllers

Chang Joo LeeChang Joo Lee

Onur Mutlu*Onur Mutlu*

Veynu NarasimanVeynu Narasiman

Yale N. PattYale N. Patt

Electrical and Computer Engineering Electrical and Computer Engineering The University of Texas at AustinThe University of Texas at Austin

*Microsoft Research and Carnegie Mellon University

2

OutlineOutline

MotivationMotivation MechanismMechanism Experimental EvaluationExperimental Evaluation ConclusionConclusion

3

Modern DRAM SystemsModern DRAM Systems

Rows and columns of DRAM cellsRows and columns of DRAM cells A A row bufferrow buffer in each bank in each bank Non-uniform access latency:Non-uniform access latency:

Row-hit: Data is in the row buffer

Row-conflict: Data is not in the row buffer Needs to access the DRAM cells

Row-hit latency << Row-conflict latency

Prioritize row-hit accesses to increase DRAM throughput [Rixner et al. ISCA2000]

DRAM Bank

Row Buffer

Data Bus

Row A

Processor: Row A

Row-hit

Processor: Row B

Row B

Row-conflict

4

Problems of Prefetch HandlingProblems of Prefetch Handling

How to schedule prefetches vs demands? Demand-first: Always prioritizes demands over

prefetch requests Demand-prefetch-equal: Always treats them the same

Neither take into account both:

1. Non-uniform access latency of DRAM systems

2. Usefulness of prefetches

Neither of these perform best

5

When Prefetches are UsefulWhen Prefetches are Useful

Row A

Pref Row A : X

Dem Row B : Y

Pref Row A : Z

DRAM Controller

Row Buffer

DRAM

DRAM

Processor

Demand-first Demand-first

Row-conflict

Row B

Row-hit

Miss Y Miss X Miss Z

Stall Execution

Processor needs Y, X, and Z

2 row-conflicts, 1 row-hit

6

When Prefetches are UsefulWhen Prefetches are Useful

Row A

Pref Row A : X

Dem Row B : Y

Pref Row A : Z

DRAM Controller

Row Buffer

DRAM

DRAM

Processor

DRAM

Processor

Demand-firstDemand-first

Demand-pref-equalDemand-pref-equal

Row-hitRow-conflict

Saved Cycles

Row B

Miss Y Miss X Miss Z

Miss Y Hit X Hit Z

Demand-pref-equal outperforms demand-first

Stall Execution

Processor needs Y, X, and Z

2 row-conflicts, 1 row-hit

2 row-hits, 1 row-conflict

7

When Prefetches are UselessWhen Prefetches are Useless

Row A

Pref Row A : X

Dem Row B : Y

Pref Row A : Z

DRAM Controller

Row Buffer

DRAM

DRAM

Processor

DRAM

Processor

Demand-firstDemand-first

Demand-pref-equalDemand-pref-equal

Saved CyclesMiss Y

Miss Y

Demand-first outperforms demand-pref-equal

Y X Z

X Z Y

Processor needs ONLY Y

8

Demand-first vs. Demand-pref-equal policyDemand-first vs. Demand-pref-equal policy

Stream prefetcher enabled

0

0.5

1

1.5

2

2.5

3

galgel

amm

p

artm

ilcswim

libquantum

bwaves

leslie3d

IPC

no

rma

lize

d to

no

pre

fetc

hin

g

Demand-first

Demand-pref-equal

Demand-first is betterDemand-pref-equal is betterGoal 1: Adaptively schedule prefetches based on prefetch usefulnessGoal 2: Eliminate useless prefetches

Useless prefetches:

Off-chip bandwidth

Queue resources

Cache Pollution

9

GoalsGoals

1. Maximize the benefits of prefetching1. Maximize the benefits of prefetching::Increase DRAM throughput by adaptively Increase DRAM throughput by adaptively scheduling requests based on prefetch usefulnessscheduling requests based on prefetch usefulness

→ → increase timeliness of useful prefetchesincrease timeliness of useful prefetches

2. Minimize the harm of prefetching:2. Minimize the harm of prefetching:Adaptively delay the service of useless Adaptively delay the service of useless prefetches and remove useless prefetchesprefetches and remove useless prefetches

→ → increase efficiency of resource utilizationincrease efficiency of resource utilization

Achieve higher performance and efficiencyAchieve higher performance and efficiency

10

OutlineOutline

Motivation MechanismMechanism Experimental Evaluation Conclusion

11

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers (PADC)(PADC)

Adaptive Prefetch Scheduling Adaptive Prefetch Scheduling (APS): (APS): Prioritizes prefetch and Prioritizes prefetch and demand requests based on prefetch demand requests based on prefetch accuracy estimationaccuracy estimation

Adaptive Prefetch Dropping Adaptive Prefetch Dropping (APD): (APD): Cancels likely-useless Cancels likely-useless prefetches from memory request prefetches from memory request buffer based on prefetch accuracybuffer based on prefetch accuracy

APS

APD

Memory request buffer

Update

Requestpriority

Drop

Request Info

Prefetch accuracy from each core

To DRAM

PADCPADC

12

Prefetch Accuracy EstimationPrefetch Accuracy Estimation

Prefetch accuracy = Prefetch accuracy =

Hardware support:Hardware support: Prefetch bit (per L2 cache line, MSHR entry): Prefetch bit (per L2 cache line, MSHR entry):

Indicates whether it is a prefetch or demandIndicates whether it is a prefetch or demand Prefetch sent counter (per core)Prefetch sent counter (per core) Prefetch used counter (per core)Prefetch used counter (per core) Prefetch accuracy register (per core)Prefetch accuracy register (per core)

Estimated every 100K cyclesEstimated every 100K cycles

#Prefetches used#Prefetches used

#Prefetches sent#Prefetches sent

13

Adaptive Prefetch Scheduling (APS)Adaptive Prefetch Scheduling (APS)

1. Adaptively change the priority of prefetch requests1. Adaptively change the priority of prefetch requests Low prefetch accuracy → Low prefetch accuracy → prioritize demands from the coreprioritize demands from the core High prefetch accuracy High prefetch accuracy → → treat demands and prefetches equally treat demands and prefetches equally

2. In a CMP system: prioritize demand requests from a core 2. In a CMP system: prioritize demand requests from a core

that has many useless prefetchesthat has many useless prefetches To avoid starving demand requests from a core with low prefetch To avoid starving demand requests from a core with low prefetch

accuracy → improves system performance accuracy → improves system performance

APS

APD


Update

Requestpriority

Drop

Request Info


To DRAM

PADCPADC

14


1. Critical requests 1. Critical requests All demand requestsAll demand requests Prefetch requests from cores whose Prefetch requests from cores whose

prefetch accuracy ≥ promotion thresholdprefetch accuracy ≥ promotion threshold

2. Urgent requests2. Urgent requests Demand requests from cores whose Demand requests from cores whose

prefetch accuracy < promotion thresholdprefetch accuracy < promotion threshold

15


Prioritization order:Prioritization order:1. Critical request (C)1. Critical request (C)

2. Row-hit request (RH)2. Row-hit request (RH)

3. Urgent request (U)3. Urgent request (U)

4. Oldest request (FCFS)4. Oldest request (FCFS)

C RH U FCFS

Each memory request buffer entry: priority fields Each memory request buffer entry: priority fields

16

Adaptive Prefetch Dropping (APD)Adaptive Prefetch Dropping (APD)

Proactively drops old prefetches based on prefetch Proactively drops old prefetches based on prefetch accuracy estimationaccuracy estimation Old requests are likely uselessOld requests are likely useless

APS prioritizes demand requests when prefetch accuracy is lowAPS prioritizes demand requests when prefetch accuracy is low A prefetch that is hit by a demand is promoted to a demand A prefetch that is hit by a demand is promoted to a demand

Dropping old, useless prefetches saves resourcesDropping old, useless prefetches saves resources(bandwidth, queues, caches)(bandwidth, queues, caches) Saved resources can be used by useful requestsSaved resources can be used by useful requests

APS

APD


Update

Requestpriority

Drop

Request Info


To DRAM

PADCPADC

17

Adaptive Prefetch Dropping (APD)Adaptive Prefetch Dropping (APD)

Prefetch bit (P)Prefetch bit (P) Core ID field (ID)Core ID field (ID) Age field (AGE) Age field (AGE)

Drop prefetch requests whoseDrop prefetch requests whoseAGE > Drop thresholdAGE > Drop threshold

Drop threshold is dynamically determined based on Drop threshold is dynamically determined based on prefetch accuracy estimationprefetch accuracy estimation Lower accuracy Lower accuracy →→ Lower threshold Lower threshold

P ID AGE

Each memory request buffer entry: drop informationEach memory request buffer entry: drop information

18

Hardware Cost for 4-core CMP

Cost (bits)Cost (bits)

Prefetch Accuracy Prefetch Accuracy EstimationEstimation

33,056

APSAPS 128

APDAPD 1,536

Total 34,720

Total storage: 34,720 bits (~4.25KB) are neededTotal storage: 34,720 bits (~4.25KB) are needed ~ ~ 4KB are prefetch bits in each cache line4KB are prefetch bits in each cache line If prefetch bits are already implemented: ~228BIf prefetch bits are already implemented: ~228B

Logic is not on the critical pathLogic is not on the critical path Scheduling and dropping decisions are made every DRAM bus cycleScheduling and dropping decisions are made every DRAM bus cycle

19

OutlineOutline

Motivation Mechanism Experimental EvaluationExperimental Evaluation Conclusion

20

Simulation MethodologySimulation Methodology

x86 cycle accurate simulatorx86 cycle accurate simulator Baseline processor configuration Baseline processor configuration

Per corePer core 4-wide issue, out-of-order, 256-entry ROB4-wide issue, out-of-order, 256-entry ROB 512KB, 8-way unified L2 cache (1MB for single core processor) 512KB, 8-way unified L2 cache (1MB for single core processor) Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64)Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64)

SharedShared On-chip, demand-first FR-FCFS memory controllerOn-chip, demand-first FR-FCFS memory controller 64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core DDR3 1333, 15-15-15ns, 4KB row buffer DDR3 1333, 15-15-15ns, 4KB row buffer

PADC configurationPADC configuration Promotion threshold: 85%Promotion threshold: 85% Drop threshold:Drop threshold: Prefetch accuracy (%) 0~10 10~30 30~70 70~100

Threshold (core cycles) 100 1,500 50,000 100,000

21

Workloads for EvaluationWorkloads for Evaluation

Single-core processor:Single-core processor:All 55 SPEC 2000/2006 benchmarksAll 55 SPEC 2000/2006 benchmarks Single-threaded Single-threaded 38 prefetch sensitive benchmarks38 prefetch sensitive benchmarks 17 prefetch insensitive benchmarks17 prefetch insensitive benchmarks

CMP:CMP:Randomly chosen multiprogrammed workloads from 55 Randomly chosen multiprogrammed workloads from 55 benchmarks:benchmarks: 4-core CMP: 32 workloads 4-core CMP: 32 workloads 8-core CMP: 21 workloads 8-core CMP: 21 workloads

22

Performance of PADCPerformance of PADC

Single-core

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to d

eman

d-fir

st

No-pref

Demand-first

Demand-pref-equal

PADC

4-core CMP

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to d

eman

d-fir

st

No-pref

Demand-first

Demand-pref-equal

PADC

8-core CMP

0

0.2

0.4

0.6

0.8

1

1.2

AverageN

orm

aliz

ed to

dem

and-

first

No-pref

Demand-first

Demand-pref-equal

PADC

4.3% 8.2% 9.9%

23

Bus Traffic of PADCBus Traffic of PADC

Single-core

0

0.5

1

1.5

2

2.5

3

3.5

Average

Mill

ion

cach

e lin

es

No-pref

Demand-first

Demand-pref-equal

PADC

4-core CMP

0

2

4

6

8

10

12

Average

Mill

ion

cach

e lin

es

No-pref

Demand-first

Demand-pref-equal

PADC

8-core CMP

0

2

4

6

8

10

12

14

16

18

20

AverageM

illio

n ca

che

lines

No-pref

Demand-first

Demand-pref-equal

PADC

-10.4% -10.7% -9.4%

24

Performance with Other PrefetchersPerformance with Other Prefetchers

6.0% 6.6% 2.2%

Stride

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to n

o pr

efet

chin

g

No-pref

Demand-first

PADC

GHB

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to n

o pr

efet

chin

g

No-pref

Demand-first

PADC

Markov

0

0.2

0.4

0.6

0.8

1

1.2

AverageN

orm

aliz

ed to

no

pref

etch

ing

No-pref

Demand-first

PADC

4-core CMP

25-5.7% -6.8% -10.3%

Stride

0

2

4

6

8

10

12

Average

Mill

ion

cach

e lin

es

No-pref

Demand-first

PADC

GHB

0

2

4

6

8

10

12

Average

Mill

ion

cach

e lin

es

No-pref

Demand-first

PADC

Markov

0

2

4

6

8

10

12

AverageM

illio

n ca

che

lines

No-pref

Demand-first

PADC

4-core CMP

Bus Traffic with Other PrefetchersBus Traffic with Other Prefetchers

26

OutlineOutline

Motivation Mechanism Experimental Evaluation ConclusionConclusion

27

ConclusionsConclusions

Prefetch-Aware DRAM Controllers (PADC)Prefetch-Aware DRAM Controllers (PADC) Adaptive Prefetch SchedulingAdaptive Prefetch Scheduling

Increase DRAM throughput by exploiting row-buffer locality when Increase DRAM throughput by exploiting row-buffer locality when prefetches are usefulprefetches are useful

Delay service of prefetches when they are uselessDelay service of prefetches when they are useless

Adaptive Prefetch Dropping Adaptive Prefetch Dropping With APS, remove useless prefetches effectively while keeping the With APS, remove useless prefetches effectively while keeping the

benefits of useful prefetchesbenefits of useful prefetches

Improve performance and bandwidth efficiency for both Improve performance and bandwidth efficiency for both single-core and CMP systemssingle-core and CMP systems

Low cost and easily implementableLow cost and easily implementable

28

Questions?Questions?

29

Performance Detail

Single-core: 38 prefetch-sensitive: 6.2%

Prefetch-friendly: 29 benchmarks Prefetch-unfriendly: 9 benchmarks 17 out of 38 are memory intensive

(MPKI > 10) : 11.8% 17 prefetch-insensitive

30

Two Channel Memory Two Channel Memory PerformancePerformance

5.9% 5.5%

4-core CMP

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to d

eman

d-fir

st

1ch-demand-first

No-pref

Demand-first

Demand-pref-equal

PADC

8-core CMP

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to d

eman

d-fir

st

1ch-demand-first

No-pref

Demand-first

Demand-pref-equal

PADC

31%16%

31

Two Channel Memory Two Channel Memory Bus TrafficBus Traffic

-12.9% -13.2%

4-core CMP

0

2

4

6

8

10

12

Average

Mill

ion

cach

e lin

es

No-pref

Demand-first

Demand-pref-equal

PADC

8-core CMP

0

2

4

6

8

10

12

14

16

18

20

Average

Mill

ion

cach

e lin

es

No-pref

Demand-first

Demand-pref-equal

PADC

32

Comparison with Feedback Comparison with Feedback Directed PrefetchingDirected Prefetching

6.4%

Performance

0

0.2

0.4

0.6

0.8

1

1.2

Average

Nor

mal

ized

to d

eman

d-fir

st

Demand-first

fdp-demand-first

apd-demand-first

fdp-demand-pref-equal

fdp-aps

PADC(aps-apd)

Bus traffic

0

2

4

6

8

10

12

Average

Mill

ion

cach

e lin

es

Demand-first

fdp-demand-first

apd-demand-first

fdp-demand-pref-equal

fdp-aps

PADC(aps-apd)

4-core CMP

33

Performance on Single-CorePerformance on Single-Core

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

galgel

amm

p

omnetpp

artm

ilcswim

libquantum

bwaves

leslie3d

soplex

gmean

No

rma

lize

d IP

C to

de

ma

nd

-fir

st

No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)

34

Prefetch Friendly ApplicationPrefetch Friendly Application

libquantumlibquantumBus traffic

0

0.5

1

1.5

2

2.5

3

Mill

ion

cac

he

lines

UselessUseful

Demand

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

No

rmal

ized

IPC

to d

eman

d-f

irst

35

Prefetch Unfriendly ApplicationPrefetch Unfriendly Application

artartBus traffic

0

5

10

15

20

25

30

35

Mill

ion

cac

he

lines

UselessUseful

Demand

Performance

0

0.2

0.4

0.6

0.8

1

1.2

No

rmal

ized

IPC

to d

eman

d-f

irst

36

Average Performance on Single-CoreAverage Performance on Single-Core

All 55 SPEC 2000/2006 CPU benchmarksAll 55 SPEC 2000/2006 CPU benchmarksBus traffic

0

0.5

1

1.5

2

2.5

3

3.5

Mill

ion

cac

he

lines

Performance

0

0.2

0.4

0.6

0.8

1

1.2

No

rmal

ized

IPC

to d

eman

d-f

irst

37

System Performance on 4-Core CMPSystem Performance on 4-Core CMP

32 randomly chosen 4-core workloads32 randomly chosen 4-core workloads

System performance

0

0.5

1

1.5

2

2.5

3

3.5

WS HS

Met

ric

No-pref

Demand-first

Demand-pref-equal

APS-only

APS-APD (PADC)

Average bus traffic

0

2

4

6

8

10

12

14

16

18

20

TrafficM

illio

n c

ach

e lin

es


38

System Performance on 8-core CMPSystem Performance on 8-core CMP

21 randomly chosen 8-core workloads21 randomly chosen 8-core workloads

System performance

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

WS HS

Met

ric

No-pref

Demand-first

Demand-pref-equal

APS-only

APS-APD (PADC)

Average bus traffic

0

2

4

6

8

10

12

14

16

18

20

Traffic

Mill

ion

cac

he

lines

No-pref

Demand-first

Demand-pref-equal

APS-only

APS-APD (PADC)

39

Prefetch Friendly ApplicationPrefetch Friendly Application

leslie3dleslie3dBus traffic

0

1

2

3

4

5

6

Mill

ion

cac

he

lines

UselessUseful

Demand

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

No

rmal

ized

IPC

to d

eman

d-f

irst

40

Prefetch Unfriendly ApplicationPrefetch Unfriendly Application

ammpammpBus traffic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Mill

ion

cac

he

lines Useless

Useful

Demand

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

No

rmal

ized

IPC

to d

eman

d-f

irst

41

Performance on 4-CorePerformance on 4-Core

omnetpp, libquantum, galgel, and GemsFDTD on 4-core omnetpp, libquantum, galgel, and GemsFDTD on 4-core CMP CMP Individual speedup

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

omnetpp libquantum galgel GemsFDTD

Sp

eed

up

to s

ing

le a

pp

licat

ion

ru

n IP

C No-prefDemand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)

System performance

0

0.5

1

1.5

2

2.5

WS HS

Met

ric


42

Performance on 4-CorePerformance on 4-Core

omnetpp, libquantum, galgel, and GemsFDTD on 4-core omnetpp, libquantum, galgel, and GemsFDTD on 4-core CMP CMP

Individual speedup

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Sp

eed

up

to s

ing

le a

pp

licat

ion

ru

n IP

C


0

1

2

3

4

5

6

7

No-p

ref

Dem

and-first

Dem

and-pref-eq

ual

AP

S-o

nly

AP

S-A

PD

(PA

DC

)

No-p

ref

Dem

and-first

Dem

and-pref-eq

ual

AP

S-o

nly

AP

S-A

PD

(PA

DC

)

No-p

ref

Dem

and-first

Dem

and-pref-eq

ual

AP

S-o

nly

AP

S-A

PD

(PA

DC

)

No-p

ref

Dem

and-first

Dem

and-pref-eq

ual

AP

S-o

nly

AP

S-A

PD

(PA

DC

)

omnetpp libquantum galgel GemsFDTD

Mill

ion

ca

che

lin

es

Useless

Useful

Demand

43

System Performance on 4-CoreSystem Performance on 4-Core

omnetpp, libquantum, galgel, and GemsFDTDomnetpp, libquantum, galgel, and GemsFDTD

System performance

0

0.5

1

1.5

2

2.5

WS HS

Met

ric No-pref

Demand-firstDemand-pref-equalAPS-onlyAPS-APD (PADC)

Bus traffic

0

2

4

6

8

10

12

14

16

18

Mill

ion

cac

he

lines

UselessUsefulDemand

Documents

Prefetch -Aware DRAM Controllers