Adaptive Insertion Policies for High-Performance Caching

1

Adaptive Insertion Policies for High-Performance

CachingMoinuddin K.

QureshiYale N. Patt

International Symposium on Computer Architecture (ISCA) 2007

Aamer JaleelSimon C. Steely

Jr.Joel Emer

2

Background

L1 misses Short latency, can be hidden

L2 misses Long-latency, hurts performance

Important to reduce Last Level (L2) cache misses

MemoryL2 miss

Proc L2 L1

Fast processor + Slow memory Cache hierarchy

(~2 cycles) (~10 cycles)(~300 cycles)

3

Motivation

L1 for latency, L2 for capacity

Traditionally L2 managed similar to L1 (typically LRU)

L1 filters temporal locality Poor locality at L2

LRU causes thrashing when working set > cache size Most lines remain unused between insertion and

eviction

4

Dead on Arrival (DoA) Lines

DoA Lines: Lines unused between insertion and eviction

For the 1MB 16-way L2, 60% of lines are DoA

Ineffective use of cache space

(%

) D

oA

Lin

es

5

Why DoA Lines ? Streaming data Never reused. L2 caches don’t help.

Working set of application greater than cache size

Soln: if working set > cache size, retain some working set

art

Mis

ses p

er

10

00

in

str

ucti

on

s

Cache size in MB

mcf

Mis

ses p

er

10

00

in

str

ucti

on

s

Cache size in MB

6

Overview

Problem: LRU replacement inefficient for L2 caches

Goal: A replacement policy that has:1. Low hardware overhead2. Low complexity3. High performance4. Robust across workloads

Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes

7

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

8

Cache Insertion Policy

Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads

Two components of cache replacement:

1. Victim Selection: Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU)

2. Insertion Policy: Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position)

9

LRU-Insertion Policy (LIP)

a b c d e f g hMRU LRU

i a b c d e f g

Reference to ‘i’ with traditional LRU policy:

a b c d e f g i

Reference to ‘i’ with LIP:

Choose victim. Do NOT promote to MRU

Lines do not enter non-LRU positions unless reused

10

if ( rand() < ) Insert at MRU position;

elseInsert at LRU position;

Bimodal-Insertion Policy (BIP)

LIP does not age older lines

Infrequently insert lines in MRU position

Let Bimodal throttle parameter

For small , BIP retains thrashing protection of LIP while responding to changes in working set

11

Circular Reference Model

For small , BIP retains thrashing protection of LIP while adapting to changes in working set

Policy (a1 a2 a3 …

aT)N

(b1 b2 b3 …

bT)N

LRU 0 0

OPT (K-1)/(T-1) (K-1)/(T-1)

LIP (K-1)/T 0

BIP (small ) ≈ (K-1)/T ≈ (K-1)/T

Reference stream has T blocks and repeats N times.

Cache has K blocks (K<T and N>>T)

[Smith & GoodmanISCA’84]

12

Results for LIP and BIP

Changes to insertion policy increases misses for LRU-friendly workloads

LIP BIP1/32)

(%)

Red

uct

ion

in

L2

MPK

I

13

Outline

Introduction



Summary

14

Dynamic-Insertion Policy (DIP)

Two types of workloads: LRU-friendly or BIP-friendly

DIP can be implemented by:

1. Monitor both policies (LRU and BIP)

2. Choose the best-performing policy

3. Apply the best policy to the cache

Need a cost-effective implementation “Set Dueling”

15

LRU-sets

Follower Sets

BIP-sets

DIP via “Set Dueling”

Divide the cache in three:– Dedicated LRU sets– Dedicated BIP sets – Follower sets (winner of

LRU,BIP)

n-bit saturating counter misses to LRU-sets: counter++misses to BIP-set: counter--

Counter decides policy for Follower sets:– MSB = 0, Use LRU– MSB = 1, Use BIP

n-bit cntr+

miss

–miss

MSB = 0?

YES No

Use LRU Use BIP

monitor choose apply

(using a single counter)

16

Bounds on Dedicated Sets

How many dedicated sets required for “Set Dueling”?

μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP

P(Best) = probability of selecting best policy

P(Best) = P(Z< r√n)

n = number of dedicated setsZ = standard Gaussian variable

r = |μLRU- μBIP|/√(σLRU2 + σBIP

2)

(For majority workloads r > 0.2)32-64 dedicated sets sufficient

17

Results for DIP

DIP reduces average MPKI by 21% and requires < two bytes storage

overhead

DIP (32 dedicated sets)BIP

(%)

Red

uct

ion

in

L2

MPK

I

18

DIP vs. Other Policies

0

5

10

15

20

25

30

35

(LRU+RND) (LRU+LFU) (LRU+MRU) DIP OPT Double

% R

educ

tion

in

aver

age

MPK

I

DIP bridges two-thirds of gap between LRU and OPT

DIP OPT Double(2MB)

(LRU+RND) (LRU+LFU) (LRU+MRU)

(%)

Red

uct

ion

in

L2

MPK

I

19

IPC Improvement

Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU

IPC

Im

pro

vem

ent

wit

h D

IP (

%)

DIP Improves IPC by 9.3% on average

20

Outline

Introduction



Summary

21

Summary

LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction

Proposed changes to cache insertion policy (DIP) has:

1. Low hardware overheadRequires < two bytes storage overhead

2. Low complexityTrivial to implement. No changes to cache

structure

3. High performanceReduces misses by 21%. Two-thirds as good

as OPT

4. Robust across workloadsAlmost as good as LRU for LRU-friendly

workloads

22

Questions

source code:www.ece.utexas.edu/~qk/dip

http://www.ece.utexas.edu/~qk/dip

23

DIP vs. LRU Across Cache Sizes

LRU DIP

}

1MB

} }}

8MB2MB4MB

MPK

I R

ela

tive t

o 1

MB

LR

U

(%)

(Sm

alle

r is

bett

er)

art mcf equake swim health Avg_16

MPKI reduces till workload fits in the cache

24

DIP with 1MB 8-way L2 Cache

MPKI reduction with 8-way (19%) similar to 16-way (21%)

0

10

20

30

40

50

(%)

Red

uct

ion

in

L2

MPK

I

25

Interaction with Prefetching(%

) R

ed

uct

ion

in

L2

MPK

I

DIP-NoPrefLRU-Pref

DIP-Pref

DIP also works well in presence of prefetching

(PC-based stride prefetcher)

26

mcf snippet

27

art snippet

28

health mpki

29

swim mpki

30

DIP Bypass

31

DIP (design and implementation)

32

Random Replacement (Success Function)

Cache contains K blocks and reference stream contains T

Prob that a block in cache survives 1 eviction = (1-1/K)Total number of evictions = (T-1)*Pmiss

Phit = (1-1/K)^(T-1)*Pmiss)Phit = (1-1/K)^(T-1)(1-Phit)

Iterative solution: Start at Phit=0

1. Phit = (1-1/K)^T

Documents

Adaptive Insertion Policies for High-Performance Caching