32
1 Adaptive Insertion Policies for High- Performance Caching Moinuddin K. Qureshi Yale N. Patt International Symposium on Computer Architecture (ISCA) 2007 Aamer Jaleel Simon C. Steely Jr. Joel Emer

Adaptive Insertion Policies for High-Performance Caching

Embed Size (px)

DESCRIPTION

Adaptive Insertion Policies for High-Performance Caching. Aamer Jaleel Simon C. Steely Jr. Joel Emer. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Computer Architecture (ISCA) 2007. Memory. L2 miss. Proc. L2. L1. Background. - PowerPoint PPT Presentation

Citation preview

Page 1: Adaptive Insertion Policies for High-Performance Caching

1

Adaptive Insertion Policies for High-Performance

CachingMoinuddin K.

QureshiYale N. Patt

International Symposium on Computer Architecture (ISCA) 2007

Aamer JaleelSimon C. Steely

Jr.Joel Emer

Page 2: Adaptive Insertion Policies for High-Performance Caching

2

Background

L1 misses Short latency, can be hidden

L2 misses Long-latency, hurts performance

Important to reduce Last Level (L2) cache misses

MemoryL2 miss

Proc L2 L1

Fast processor + Slow memory Cache hierarchy

(~2 cycles) (~10 cycles)(~300 cycles)

Page 3: Adaptive Insertion Policies for High-Performance Caching

3

Motivation

L1 for latency, L2 for capacity

Traditionally L2 managed similar to L1 (typically LRU)

L1 filters temporal locality Poor locality at L2

LRU causes thrashing when working set > cache size Most lines remain unused between insertion and

eviction

Page 4: Adaptive Insertion Policies for High-Performance Caching

4

Dead on Arrival (DoA) Lines

DoA Lines: Lines unused between insertion and eviction

For the 1MB 16-way L2, 60% of lines are DoA

Ineffective use of cache space

(%

) D

oA

Lin

es

Page 5: Adaptive Insertion Policies for High-Performance Caching

5

Why DoA Lines ? Streaming data Never reused. L2 caches don’t help.

Working set of application greater than cache size

Soln: if working set > cache size, retain some working set

art

Mis

ses p

er

10

00

in

str

ucti

on

s

Cache size in MB

mcf

Mis

ses p

er

10

00

in

str

ucti

on

s

Cache size in MB

Page 6: Adaptive Insertion Policies for High-Performance Caching

6

Overview

Problem: LRU replacement inefficient for L2 caches

Goal: A replacement policy that has:1. Low hardware overhead2. Low complexity3. High performance4. Robust across workloads

Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes

Page 7: Adaptive Insertion Policies for High-Performance Caching

7

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

Page 8: Adaptive Insertion Policies for High-Performance Caching

8

Cache Insertion Policy

Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads

Two components of cache replacement:

1. Victim Selection: Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU)

2. Insertion Policy: Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position)

Page 9: Adaptive Insertion Policies for High-Performance Caching

9

LRU-Insertion Policy (LIP)

a b c d e f g hMRU LRU

i a b c d e f g

Reference to ‘i’ with traditional LRU policy:

a b c d e f g i

Reference to ‘i’ with LIP:

Choose victim. Do NOT promote to MRU

Lines do not enter non-LRU positions unless reused

Page 10: Adaptive Insertion Policies for High-Performance Caching

10

if ( rand() < ) Insert at MRU position;

elseInsert at LRU position;

Bimodal-Insertion Policy (BIP)

LIP does not age older lines

Infrequently insert lines in MRU position

Let Bimodal throttle parameter

For small , BIP retains thrashing protection of LIP while responding to changes in working set

Page 11: Adaptive Insertion Policies for High-Performance Caching

11

Circular Reference Model

For small , BIP retains thrashing protection of LIP while adapting to changes in working set

Policy (a1 a2 a3 …

aT)N

(b1 b2 b3 …

bT)N

LRU 0 0

OPT (K-1)/(T-1) (K-1)/(T-1)

LIP (K-1)/T 0

BIP (small ) ≈ (K-1)/T ≈ (K-1)/T

Reference stream has T blocks and repeats N times.

Cache has K blocks (K<T and N>>T)

[Smith & GoodmanISCA’84]

Page 12: Adaptive Insertion Policies for High-Performance Caching

12

Results for LIP and BIP

Changes to insertion policy increases misses for LRU-friendly workloads

LIP BIP1/32)

(%)

Red

uct

ion

in

L2

MPK

I

Page 13: Adaptive Insertion Policies for High-Performance Caching

13

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

Page 14: Adaptive Insertion Policies for High-Performance Caching

14

Dynamic-Insertion Policy (DIP)

Two types of workloads: LRU-friendly or BIP-friendly

DIP can be implemented by:

1. Monitor both policies (LRU and BIP)

2. Choose the best-performing policy

3. Apply the best policy to the cache

Need a cost-effective implementation “Set Dueling”

Page 15: Adaptive Insertion Policies for High-Performance Caching

15

LRU-sets

Follower Sets

BIP-sets

DIP via “Set Dueling”

Divide the cache in three:– Dedicated LRU sets– Dedicated BIP sets – Follower sets (winner of

LRU,BIP)

n-bit saturating counter misses to LRU-sets: counter++misses to BIP-set: counter--

Counter decides policy for Follower sets:– MSB = 0, Use LRU– MSB = 1, Use BIP

n-bit cntr+

miss

–miss

MSB = 0?

YES No

Use LRU Use BIP

monitor choose apply

(using a single counter)

Page 16: Adaptive Insertion Policies for High-Performance Caching

16

Bounds on Dedicated Sets

How many dedicated sets required for “Set Dueling”?

μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP

P(Best) = probability of selecting best policy

P(Best) = P(Z< r√n)

n = number of dedicated setsZ = standard Gaussian variable

r = |μLRU- μBIP|/√(σLRU2 + σBIP

2)

(For majority workloads r > 0.2)32-64 dedicated sets sufficient

Page 17: Adaptive Insertion Policies for High-Performance Caching

17

Results for DIP

DIP reduces average MPKI by 21% and requires < two bytes storage

overhead

DIP (32 dedicated sets)BIP

(%)

Red

uct

ion

in

L2

MPK

I

Page 18: Adaptive Insertion Policies for High-Performance Caching

18

DIP vs. Other Policies

0

5

10

15

20

25

30

35

(LRU+RND) (LRU+LFU) (LRU+MRU) DIP OPT Double

% R

educ

tion

in

aver

age

MPK

I

DIP bridges two-thirds of gap between LRU and OPT

DIP OPT Double(2MB)

(LRU+RND) (LRU+LFU) (LRU+MRU)

(%)

Red

uct

ion

in

L2

MPK

I

Page 19: Adaptive Insertion Policies for High-Performance Caching

19

IPC Improvement

Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU

IPC

Im

pro

vem

ent

wit

h D

IP (

%)

DIP Improves IPC by 9.3% on average

Page 20: Adaptive Insertion Policies for High-Performance Caching

20

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

Page 21: Adaptive Insertion Policies for High-Performance Caching

21

Summary

LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction

Proposed changes to cache insertion policy (DIP) has:

1. Low hardware overheadRequires < two bytes storage overhead

2. Low complexityTrivial to implement. No changes to cache

structure

3. High performanceReduces misses by 21%. Two-thirds as good

as OPT

4. Robust across workloadsAlmost as good as LRU for LRU-friendly

workloads

Page 22: Adaptive Insertion Policies for High-Performance Caching

22

Questions

source code:www.ece.utexas.edu/~qk/dip

Page 23: Adaptive Insertion Policies for High-Performance Caching

23

DIP vs. LRU Across Cache Sizes

LRU DIP

}

1MB

} }}

8MB2MB4MB

MPK

I R

ela

tive t

o 1

MB

LR

U

(%)

(Sm

alle

r is

bett

er)

art mcf equake swim health Avg_16

MPKI reduces till workload fits in the cache

Page 24: Adaptive Insertion Policies for High-Performance Caching

24

DIP with 1MB 8-way L2 Cache

MPKI reduction with 8-way (19%) similar to 16-way (21%)

0

10

20

30

40

50

(%)

Red

uct

ion

in

L2

MPK

I

Page 25: Adaptive Insertion Policies for High-Performance Caching

25

Interaction with Prefetching(%

) R

ed

uct

ion

in

L2

MPK

I

DIP-NoPrefLRU-Pref

DIP-Pref

DIP also works well in presence of prefetching

(PC-based stride prefetcher)

Page 26: Adaptive Insertion Policies for High-Performance Caching

26

mcf snippet

Page 27: Adaptive Insertion Policies for High-Performance Caching

27

art snippet

Page 28: Adaptive Insertion Policies for High-Performance Caching

28

health mpki

Page 29: Adaptive Insertion Policies for High-Performance Caching

29

swim mpki

Page 30: Adaptive Insertion Policies for High-Performance Caching

30

DIP Bypass

Page 31: Adaptive Insertion Policies for High-Performance Caching

31

DIP (design and implementation)

Page 32: Adaptive Insertion Policies for High-Performance Caching

32

Random Replacement (Success Function)

Cache contains K blocks and reference stream contains T

Prob that a block in cache survives 1 eviction = (1-1/K)Total number of evictions = (T-1)*Pmiss

Phit = (1-1/K)^(T-1)*Pmiss)Phit = (1-1/K)^(T-1)(1-Phit)

Iterative solution: Start at Phit=0

1. Phit = (1-1/K)^T