Adaptive Insertion Policies for High-Performance Caching

Preview:

DESCRIPTION

Adaptive Insertion Policies for High-Performance Caching. Aamer Jaleel Simon C. Steely Jr. Joel Emer. Moinuddin K. Qureshi Yale N. Patt. International Symposium on Computer Architecture (ISCA) 2007. Memory. L2 miss. Proc. L2. L1. Background. - PowerPoint PPT Presentation

Citation preview

1

Adaptive Insertion Policies for High-Performance

CachingMoinuddin K.

QureshiYale N. Patt

International Symposium on Computer Architecture (ISCA) 2007

Aamer JaleelSimon C. Steely

Jr.Joel Emer

2

Background

L1 misses Short latency, can be hidden

L2 misses Long-latency, hurts performance

Important to reduce Last Level (L2) cache misses

MemoryL2 miss

Proc L2 L1

Fast processor + Slow memory Cache hierarchy

(~2 cycles) (~10 cycles)(~300 cycles)

3

Motivation

L1 for latency, L2 for capacity

Traditionally L2 managed similar to L1 (typically LRU)

L1 filters temporal locality Poor locality at L2

LRU causes thrashing when working set > cache size Most lines remain unused between insertion and

eviction

4

Dead on Arrival (DoA) Lines

DoA Lines: Lines unused between insertion and eviction

For the 1MB 16-way L2, 60% of lines are DoA

Ineffective use of cache space

(%

) D

oA

Lin

es

5

Why DoA Lines ? Streaming data Never reused. L2 caches don’t help.

Working set of application greater than cache size

Soln: if working set > cache size, retain some working set

art

Mis

ses p

er

10

00

in

str

ucti

on

s

Cache size in MB

mcf

Mis

ses p

er

10

00

in

str

ucti

on

s

Cache size in MB

6

Overview

Problem: LRU replacement inefficient for L2 caches

Goal: A replacement policy that has:1. Low hardware overhead2. Low complexity3. High performance4. Robust across workloads

Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes

7

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

8

Cache Insertion Policy

Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads

Two components of cache replacement:

1. Victim Selection: Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU)

2. Insertion Policy: Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position)

9

LRU-Insertion Policy (LIP)

a b c d e f g hMRU LRU

i a b c d e f g

Reference to ‘i’ with traditional LRU policy:

a b c d e f g i

Reference to ‘i’ with LIP:

Choose victim. Do NOT promote to MRU

Lines do not enter non-LRU positions unless reused

10

if ( rand() < ) Insert at MRU position;

elseInsert at LRU position;

Bimodal-Insertion Policy (BIP)

LIP does not age older lines

Infrequently insert lines in MRU position

Let Bimodal throttle parameter

For small , BIP retains thrashing protection of LIP while responding to changes in working set

11

Circular Reference Model

For small , BIP retains thrashing protection of LIP while adapting to changes in working set

Policy (a1 a2 a3 …

aT)N

(b1 b2 b3 …

bT)N

LRU 0 0

OPT (K-1)/(T-1) (K-1)/(T-1)

LIP (K-1)/T 0

BIP (small ) ≈ (K-1)/T ≈ (K-1)/T

Reference stream has T blocks and repeats N times.

Cache has K blocks (K<T and N>>T)

[Smith & GoodmanISCA’84]

12

Results for LIP and BIP

Changes to insertion policy increases misses for LRU-friendly workloads

LIP BIP1/32)

(%)

Red

uct

ion

in

L2

MPK

I

13

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

14

Dynamic-Insertion Policy (DIP)

Two types of workloads: LRU-friendly or BIP-friendly

DIP can be implemented by:

1. Monitor both policies (LRU and BIP)

2. Choose the best-performing policy

3. Apply the best policy to the cache

Need a cost-effective implementation “Set Dueling”

15

LRU-sets

Follower Sets

BIP-sets

DIP via “Set Dueling”

Divide the cache in three:– Dedicated LRU sets– Dedicated BIP sets – Follower sets (winner of

LRU,BIP)

n-bit saturating counter misses to LRU-sets: counter++misses to BIP-set: counter--

Counter decides policy for Follower sets:– MSB = 0, Use LRU– MSB = 1, Use BIP

n-bit cntr+

miss

–miss

MSB = 0?

YES No

Use LRU Use BIP

monitor choose apply

(using a single counter)

16

Bounds on Dedicated Sets

How many dedicated sets required for “Set Dueling”?

μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP

P(Best) = probability of selecting best policy

P(Best) = P(Z< r√n)

n = number of dedicated setsZ = standard Gaussian variable

r = |μLRU- μBIP|/√(σLRU2 + σBIP

2)

(For majority workloads r > 0.2)32-64 dedicated sets sufficient

17

Results for DIP

DIP reduces average MPKI by 21% and requires < two bytes storage

overhead

DIP (32 dedicated sets)BIP

(%)

Red

uct

ion

in

L2

MPK

I

18

DIP vs. Other Policies

0

5

10

15

20

25

30

35

(LRU+RND) (LRU+LFU) (LRU+MRU) DIP OPT Double

% R

educ

tion

in

aver

age

MPK

I

DIP bridges two-thirds of gap between LRU and OPT

DIP OPT Double(2MB)

(LRU+RND) (LRU+LFU) (LRU+MRU)

(%)

Red

uct

ion

in

L2

MPK

I

19

IPC Improvement

Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU

IPC

Im

pro

vem

ent

wit

h D

IP (

%)

DIP Improves IPC by 9.3% on average

20

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

21

Summary

LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction

Proposed changes to cache insertion policy (DIP) has:

1. Low hardware overheadRequires < two bytes storage overhead

2. Low complexityTrivial to implement. No changes to cache

structure

3. High performanceReduces misses by 21%. Two-thirds as good

as OPT

4. Robust across workloadsAlmost as good as LRU for LRU-friendly

workloads

22

Questions

source code:www.ece.utexas.edu/~qk/dip

23

DIP vs. LRU Across Cache Sizes

LRU DIP

}

1MB

} }}

8MB2MB4MB

MPK

I R

ela

tive t

o 1

MB

LR

U

(%)

(Sm

alle

r is

bett

er)

art mcf equake swim health Avg_16

MPKI reduces till workload fits in the cache

24

DIP with 1MB 8-way L2 Cache

MPKI reduction with 8-way (19%) similar to 16-way (21%)

0

10

20

30

40

50

(%)

Red

uct

ion

in

L2

MPK

I

25

Interaction with Prefetching(%

) R

ed

uct

ion

in

L2

MPK

I

DIP-NoPrefLRU-Pref

DIP-Pref

DIP also works well in presence of prefetching

(PC-based stride prefetcher)

26

mcf snippet

27

art snippet

28

health mpki

29

swim mpki

30

DIP Bypass

31

DIP (design and implementation)

32

Random Replacement (Success Function)

Cache contains K blocks and reference stream contains T

Prob that a block in cache survives 1 eviction = (1-1/K)Total number of evictions = (T-1)*Pmiss

Phit = (1-1/K)^(T-1)*Pmiss)Phit = (1-1/K)^(T-1)(1-Phit)

Iterative solution: Start at Phit=0

1. Phit = (1-1/K)^T

Recommended