36
IMPROVING CACHE MANAGEMENT POLICIES USING DYNAMIC REUSE DISTANCES Nam Duong 1 , Dali Zhao 1 , Taesu Kim 1 , Rosario Cammarota 1 , Mateo Valero 2 , Alexander V. Veidenbaum 1 1 University of California, Irvine 2 Universitat Politecnica de Catalunya and Barcelona Supercomputing Center

Improving Cache Management Policies Using Dynamic Reuse Distances

  • Upload
    adele

  • View
    44

  • Download
    1

Embed Size (px)

DESCRIPTION

Improving Cache Management Policies Using Dynamic Reuse Distances. Nam Duong 1 , Dali Zhao 1 , Taesu Kim 1 , Rosario Cammarota 1 , Mateo Valero 2 , Alexander V. Veidenbaum 1 1 University of California, Irvine 2 Universitat Politecnica de Catalunya and Barcelona Supercomputing Center. - PowerPoint PPT Presentation

Citation preview

Page 1: Improving Cache Management Policies Using Dynamic Reuse Distances

IMPROVING CACHE MANAGEMENT POLICIES USING DYNAMIC REUSE DISTANCES

Nam Duong1, Dali Zhao1, Taesu Kim1,Rosario Cammarota1, Mateo Valero2,

Alexander V. Veidenbaum1

1University of California, Irvine2Universitat Politecnica de Catalunya and

Barcelona Supercomputing Center

Page 2: Improving Cache Management Policies Using Dynamic Reuse Distances

CACHE MANAGEMENT

2

Cache Management

Single-core

Replacement

Shared-cache

Bypass Partitioning

LRUNRU

EELRUDIP

RRIP…

SPD…

UCPPIPP

TA-DIPTA-DRRIPVantage

…PDP

PDP

PDP

Prefetch

Have been a hot research topic

Page 3: Improving Cache Management Policies Using Dynamic Reuse Distances

OVERVIEW Proposed new cache replacement and partitioning algorithms

with a better balance between reuse and pollution

Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance

Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior Models are used to dynamically compute the best PD

Showed that PD-based cache management policies improve performance for both single- and multi-core systems

3

Page 4: Improving Cache Management Policies Using Dynamic Reuse Distances

OUTLINE

1. The concept of Protecting Distance2. The single-core PD-based replacement and

bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation

4

Page 5: Improving Cache Management Policies Using Dynamic Reuse Distances

DEFINITIONS The (line) reuse distance: The number of accesses to the

same cache set between two accesses to the same line This metric is directly related to hit rate

The reuse distance distribution (RDD) A distribution of observed reuse distances A program signature for a given cache configuration

RDDs of representative benchmarks X-axis: the RD (<256)

0 40 80 120160200240

403.gcc

0 39 78 117156195234

436.cactusADM

0 47 94 141188235

464.h264ref

5

Page 6: Improving Cache Management Policies Using Dynamic Reuse Distances

FUTURE BEHAVIOR PREDICTION Cache management policies use past reference

behavior to predict future accesses Prediction accuracy is critical

Prediction in some of the prior policies LRU: predicts that lines are reused after K unique

accesses, where K < W (W: cache associativity) Early eviction LRU (EELRU): Counts evictions in two non-

LRU regions (early/late) to predict a line to evict RRIP: Predicts if a line will be reused in a near, long, or

distant future

6

Page 7: Improving Cache Management Policies Using Dynamic Reuse Distances

BALANCING REUSE AND CACHE POLLUTION Key to good performance (high hit rate)

Cache lines must be reused as much as possible before eviction AND must be evicted soon after the “last” reuse to give space to

new lines

The former can be achieved by using the reuse distance and actively preventing eviction “Protecting” a line from eviction

The latter can be achieved by evicting when not reused within this distance

There is an optimal reuse distance balancing the two It is called a Protecting Distance (PD)

7

Page 8: Improving Cache Management Policies Using Dynamic Reuse Distances

EXAMPLE: 436.CACTUSADM A majority of lines are reused at 64 or fewer accesses

There are multiple peaks at different reuse distances

Reuse maximized if lines are kept in the cache for 64 accesses Lines may not be reused if evicted before that Lines kept beyond that are likely to pollute cache

Assume that no lines are kept longer than a given RD

0 33 66 99 132165198231

436.cactusADM

8

RD = 16

RD = 32

RD = 48

RD = 72

RD = 128

RD = 256

EELRU

DIPRRIP

0%

20%

40%

60%

Reduction in miss rate over LRU

Page 9: Improving Cache Management Policies Using Dynamic Reuse Distances

THE PROTECTING DISTANCE (PD) A distance at which a majority of lines are covered

A single value for all setsPredicted based on the current RDD

Questions to answer/solveWhy does using the PD achieve the balance?How to dynamically find the PD for an application and a

cache configuration?How to build the PD-based management policies?

9

Page 10: Improving Cache Management Policies Using Dynamic Reuse Distances

OUTLINE

1. The concept of Protecting Distance2. Single-core PD-based replacement and bypass

policy (PDP)3. The multi-core PD-based management policies4. Evaluation

10

Page 11: Improving Cache Management Policies Using Dynamic Reuse Distances

THE SINGLE-CORE PDP

A cache tag contains a line’s remaining PD (RPD) A line can be evicted when its RPD=0

The RPD of an inserted or promoted line set to the predicted PD RPDs of other lines in a set are decremented

Example: A 4-way cache, the predicted PD is 7 A line is promoted on a hit

A set with RPDs before and after the hit access

110 6 5 21 4 6 3

Reused line Inserted line (unused)

Page 12: Improving Cache Management Policies Using Dynamic Reuse Distances

THE SINGLE-CORE PDP (CONT.)

Selecting a victim on a miss A line with an RPD = 0 can be replaced

Two cases when all RPDs > 0 (no unprotected lines) Caches without bypass (inclusive):

Unused lines are less likely to be reused than reused lines Replace unused line with highest RPD first

No unused line: Replace a line with highest RPD

Caches with bypass (non-inclusive): Bypass the new line12

6 3 5 20 4 6 3

0 3 5 21 4 6 3

0 3 6 21 4 6 3

0 3 5 61 4 6 3

Reused line Inserted line (unused)

Page 13: Improving Cache Management Policies Using Dynamic Reuse Distances

EVALUATION OF THE STATIC PDP

Static PDP: use the best static PD for each benchmark PD < 256

SPDP-NB: Static PDP with replacement only SPDP-B: Static PDP with replacement and bypass

Performance: in general, DDRIP < SPDP-NB < SPDP-B 436.cactusADM: a 10% additional miss reduction

Two static PDP policies have similar performance 483.xalancbmk: 3 different execution windows have different

behavior for SPDP-B

13

403.g

cc

429.m

cf

433.m

ilc

434.z

eusm

p

436.c

actus

ADM

437.l

eslie

3d

450.s

oplex

456.h

mmer

459.G

emsFDTD

462.l

ibqua

ntum

464.h

264ref

470.l

bm

471.o

mnetpp

473.a

star

482.s

phinx

3

483.x

alanc

bmk.1

483.x

alanc

bmk.2

483.x

alanc

bmk.3

-5%5%

15%25%

Miss reduction over DRRIP

SPDP-NBSPDP-B

Page 14: Improving Cache Management Policies Using Dynamic Reuse Distances

436.CACTUSADM:EXPLAINING THE PERFORMANCE DIFFERENCE

How the evicted lines occupy the cache?

DRRIP: Early evicted lines: 75% of accesses, but occupy only 4% Late evicted lines: 2% of accesses, but occupy 8% of the cache → pollution

SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4%

SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache → yielding cache space to useful lines

14

Access Occupancy Access Occupancy Access OccupancyDRRIP SPDP-NB SPDP-B

0%20%40%60%80%

100%

Hit BypassEvict before 16 accesses (early) Evict after 16 accesses (late)

PDP has less pollution caused by long RD lines in the cache than RRIP

Page 15: Improving Cache Management Policies Using Dynamic Reuse Distances

CASE STUDY: 483.XALANCBMK

15

0 18 36 54 72 90 108126144162180198216234252

RDD483.xalancbmk.1483.xalancbmk.2483.xalancbmk.3

483.xalancbmk.1 483.xalancbmk.2 483.xalancbmk.30%

20%40%60%80%

Hit rate of SPDP-B

The best PD is different in different windowsAnd for different programs

Need a dynamic policy that finds best PD Need a model to drive the search

There is a close relationship between the hit rate, the PD and the RDD

Page 16: Improving Cache Management Policies Using Dynamic Reuse Distances

A HIT RATE MODEL FOR NON-INCLUSIVE CACHE

The model estimates the hit rate as a function of dp and the RDD

{Ni}, Nt: The RDD dp: The protecting distance de: Experimentally set to W (W: Cache associativity)

0 40 80 120160200240

403.gcc

0 40 80 120160200240

436.cactusADM

0 40 80 120160200240

464.h264ref

16

RDD

E

Hit rate

Used to find the PD maximizing the hit rate

ep

d

iit

d

ii

d

ii

p

ddNNiN

N

WAccessesHitsdE

pp

p

**

1*)(

11

1

Page 17: Improving Cache Management Policies Using Dynamic Reuse Distances

PDP CACHE ORGANIZATION

RD Sampler tracks access to several cache sets In L2 miss/WB stream, can reduce sampling rate Measures reuse distance of a new access

RD Counter Array collects # of accesses at RD=i, Nt

To reduce overhead, each counter covers a range of RDs PD Compute Logic: finds PD that maximizes E

Computed PD used in the next interval (.5M L3 accesses) Reasonable hardware overhead

2 or 3 bits per tag to store the RPD

17

LLC

RD Sampler RD Counter Array

PD Compute Logic

Access address

Higher level

Main memory

RDRDD

PD

Page 18: Improving Cache Management Policies Using Dynamic Reuse Distances

PDP VS. EXISTING POLICIESManagement policy

Supported policy(*) Balance Distancemeasurement

ModelReplacement Bypass Reuse Pollution

LRU Yes No No Yes Stack-based NoEELRU [1] Yes No No Yes Stack-based ProbabilisticDIP [2] Yes No Yes No N/A NoRRIP [3] Yes No Yes No N/A NoSDP [4] No Yes Yes No N/A NoPDP Yes Yes Yes Yes Access-based Hit rate

18

[1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS’99

[2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA’07

[3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA’10

[4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO’10

(*)Originally proposed EELRU has the concept of late eviction point, which shares some

similarities with the protecting distance However, lines are not always guaranteed to be protected

Page 19: Improving Cache Management Policies Using Dynamic Reuse Distances

OUTLINE

1. The concept of Protecting Distance2. The single-core PD-based replacement and

bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation

19

Page 20: Improving Cache Management Policies Using Dynamic Reuse Distances

PD-BASED SHARED CACHE PARTITIONING Each thread has its own PD (thread-aware)

Counter array replicated per thread Sampler and compute logic shared

A thread’s PD determines its cache partition Its lines occupy cache longer if its PD is large The cache is implicitly partitioned per needs of each

thread using thread PDs

The problem is to find a set of thread PDs that together maximize the hit rate

20

Page 21: Improving Cache Management Policies Using Dynamic Reuse Distances

SHARED-CACHE HIT RATE MODEL Extending the single-core approach

Compute a vector <PD> (T= number of threads)

Exhaustive search for <PD> is not practical A heuristic search algorithm finds a combination of threads’

RDD peaks that maximizes hit rate The single-core model generates top 3 peaks per thread The complexity is O(T2)

See the paper for more detail21

WTAccesses

THitsPDE

T

T 1*

Page 22: Improving Cache Management Policies Using Dynamic Reuse Distances

OUTLINE

1. The concept of Protecting Distance2. The single-core PD-based replacement and

bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation

22

Page 23: Improving Cache Management Policies Using Dynamic Reuse Distances

EVALUATION METHODOLOGY CMP$im simulator, LLC replacement Target cache: LLC

23

Cache ParamsDCache 32KB, 8-way, 64B, 2 cyclesICache 32KB, 4-way, 64B, 2 cyclesL2Cache 256KB, 8-way, 64B, 10 cyclesL3Cache (LLC) 2MB, 16-way, 64B, 30 cyclesMemory 200 cycles

Page 24: Improving Cache Management Policies Using Dynamic Reuse Distances

EVALUATION METHODOLOGY (CONT.) Benchmarks: SPEC CPU 2006 benchmarks

Excluded those which did not stress the LLC

Single-core: Compared to EELRU, SDP, DIP, DRRIP

Multi-core 4- and 16-core configurations, 80 workloads each The workloads generated by randomly combining benchmarks Compared to UCP, PIPP, TA-DRRIP

Our policy: PDP-x, where x is the number of bits per cache line 24

Page 25: Improving Cache Management Policies Using Dynamic Reuse Distances

SINGLE-CORE PDP

PDP-x, where x is the number of bits per cache line Each benchmark is executed for 1B instructions

Best if can use 3 bits per line, but still better than prior work at 2 bits25

403.g

cc

429.m

cf

433.m

ilc

434.z

eusm

p

436.c

actus

ADM

437.l

eslie

3d

450.s

oplex

456.h

mmer

459.G

emsF

DTD

462.l

ibqua

ntum

464.h

264ref

470.l

bm

471.o

mnetpp

473.a

star

482.s

phinx

3

483.x

alanc

bmk.1

483.x

alanc

bmk.2

483.x

alanc

bmk.3

Avera

ge

-30%

-20%

-10%

0%

10%

20%

30%IPC improvement over DIP

SDP DRRIP EELRU PDP-2 PDP-3 PDP-8 SPDP-B

Page 26: Improving Cache Management Policies Using Dynamic Reuse Distances

5 benchmarks which demonstrate significant phase changes Each benchmark is run for 5B instructions

Change of PD (X-axis: 1M LLC accesses)

ADAPTATION TO PROGRAM PHASES

26

0 12 24 36 48 60 720

50100150200

403.gcc

0 1002003004005006000

50100150200 429.mcf

0 54 1081622162700

50100150200

450.soplex

0 18 36 54 72 90 108

0

100

200482.sphinx3

0 36 72 1081441802160

50100150200

483.xalancbmk

Page 27: Improving Cache Management Policies Using Dynamic Reuse Distances

ADAPTATION TO PROGRAM PHASES (CONT.)

IPC improvement over DIP

27

403.g

cc

429.m

cf

450.s

oplex

482.s

phinx

3

483.x

alanc

bmk

-5%

0%

5%

10%

15%

RRIPPDP-2PDP-3PDP-8

Page 28: Improving Cache Management Policies Using Dynamic Reuse Distances

PD-BASED CACHE PARTITIONING FOR 16 CORES Normalized to TA-DRRIP

28

0 9 18 27 36 45 54 63 72-20%

0%

20%

40%W

UCPPIPPPDP-2PDP-3

Workload

0 9 18 27 36 45 54 63 72-20%

0%

20%

40%T

UCPPIPPPDP-2PDP-3

Workload

0 9 18 27 36 45 54 63 72-20%

0%

20%

40%H

UCPPIPPPDP-2PDP-3

Workload

W T H

-10%

-5%

0%

5%

10%Average

UCP PIPP PDP-2 PDP-3

Page 29: Improving Cache Management Policies Using Dynamic Reuse Distances

HARDWARE OVERHEADPolicy Per-line

bitsOverhead

(%)DIP 4 0.8%RRIP 2 0.4%SDP 4 1.4%PDP-2 2 0.6%PDP-3 3 0.8%

29

Page 30: Improving Cache Management Policies Using Dynamic Reuse Distances

OTHER RESULTS

Exploration of PDP cache parameters Cache bypass fraction Prefetch-aware PDP PD-based cache management policy for 4-core

30

Page 31: Improving Cache Management Policies Using Dynamic Reuse Distances

CONCLUSIONS Proposed the concept of Protecting Distance (PD)

Showed that it can be used to better balance reuse and cache pollution

Developed a hit rate model as a function of the PD, program behavior, and cache configuration

Proposed PD-based management policies for both single- and multi-core systems

PD-based policies outperform existing policies31

Page 32: Improving Cache Management Policies Using Dynamic Reuse Distances

THANK YOU!

32

Page 33: Improving Cache Management Policies Using Dynamic Reuse Distances

BACKUP SLIDES

RDD, E and hit rate of all benchmarks

33

Page 34: Improving Cache Management Policies Using Dynamic Reuse Distances

RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS

34

0 33 66 99 132165198231

429.mcf

0 33 66 99 132165198231

433.milc

0 33 66 99 132165198231

434.zeusmp

0 33 66 99 132165198231

436.cactusADM

0 33 66 99 132165198231

403.gccRDD

E

Hit rate

0 33 66 99 132165198231

437.leslie3d

0 33 66 99 132165198231

450.soplex

0 33 66 99 132165198231

456.hmmer

Page 35: Improving Cache Management Policies Using Dynamic Reuse Distances

RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.)

35

0 33 66 99 132165198231

459.GemsFDTD

0 33 66 99 132165198231

462.libquantum

0 33 66 99 132165198231

464.h264ref

0 33 66 99 132165198231

470.lbm

0 33 66 99 132165198231

471.omnetpp

0 33 66 99 132165198231

473.astar

0 33 66 99 132165198231

482.sphinx3

Page 36: Improving Cache Management Policies Using Dynamic Reuse Distances

RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.)

36

0 33 66 99 132165198231

483.xalancbmk.1

0 33 66 99 132165198231

483.xalancbmk.2

0 33 66 99 132165198231

483.xalancbmk.3