Adaptive Insertion Policies for High-Performance Caching

Adaptive Insertion Policies for High-Performance

CachingMoinuddin K.

QureshiYale N. Patt

International Symposium on Computer Architecture (ISCA) 2007

Aamer JaleelSimon C. Steely

Jr.Joel Emer

Background

L1 misses Short latency, can be hidden

L2 misses Long-latency, hurts performance

Important to reduce Last Level (L2) cache misses

MemoryL2 miss

Proc L2 L1

Fast processor + Slow memory Cache hierarchy

(~2 cycles) (~10 cycles)(~300 cycles)

Motivation

L1 for latency, L2 for capacity

Traditionally L2 managed similar to L1 (typically LRU)

L1 filters temporal locality Poor locality at L2

LRU causes thrashing when working set > cache size Most lines remain unused between insertion and

eviction

Dead on Arrival (DoA) Lines

DoA Lines: Lines unused between insertion and eviction

For the 1MB 16-way L2, 60% of lines are DoA

Ineffective use of cache space

Why DoA Lines ? Streaming data Never reused. L2 caches don’t help.

Working set of application greater than cache size

Soln: if working set > cache size, retain some working set

Cache size in MB

Overview

Problem: LRU replacement inefficient for L2 caches

Goal: A replacement policy that has:1. Low hardware overhead2. Low complexity3. High performance4. Robust across workloads

Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes

Outline

Introduction

Static Insertion Policies

Dynamic Insertion Policies

Summary

Cache Insertion Policy

Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads

Two components of cache replacement:

1. Victim Selection: Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU)

2. Insertion Policy: Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position)

LRU-Insertion Policy (LIP)

a b c d e f g hMRU LRU

i a b c d e f g

Reference to ‘i’ with traditional LRU policy:

a b c d e f g i

Reference to ‘i’ with LIP:

Choose victim. Do NOT promote to MRU

Lines do not enter non-LRU positions unless reused

if ( rand() < ) Insert at MRU position;

elseInsert at LRU position;

Bimodal-Insertion Policy (BIP)

LIP does not age older lines

Infrequently insert lines in MRU position

Let Bimodal throttle parameter

For small , BIP retains thrashing protection of LIP while responding to changes in working set

Circular Reference Model

For small , BIP retains thrashing protection of LIP while adapting to changes in working set

Policy (a1 a2 a3 …

(b1 b2 b3 …

LRU 0 0

OPT (K-1)/(T-1) (K-1)/(T-1)

LIP (K-1)/T 0

BIP (small ) ≈ (K-1)/T ≈ (K-1)/T

Reference stream has T blocks and repeats N times.

Cache has K blocks (K<T and N>>T)

[Smith & GoodmanISCA’84]

Results for LIP and BIP

Changes to insertion policy increases misses for LRU-friendly workloads

LIP BIP1/32)

Outline

Introduction

Summary

Dynamic-Insertion Policy (DIP)

Two types of workloads: LRU-friendly or BIP-friendly

DIP can be implemented by:

1. Monitor both policies (LRU and BIP)

2. Choose the best-performing policy

3. Apply the best policy to the cache

Need a cost-effective implementation “Set Dueling”

LRU-sets

Follower Sets

BIP-sets

DIP via “Set Dueling”

Divide the cache in three:– Dedicated LRU sets– Dedicated BIP sets – Follower sets (winner of

LRU,BIP)

n-bit saturating counter misses to LRU-sets: counter++misses to BIP-set: counter--

Counter decides policy for Follower sets:– MSB = 0, Use LRU– MSB = 1, Use BIP

n-bit cntr+

–miss

MSB = 0?

YES No

Use LRU Use BIP

monitor choose apply

(using a single counter)

Bounds on Dedicated Sets

How many dedicated sets required for “Set Dueling”?

μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP

P(Best) = probability of selecting best policy

P(Best) = P(Z< r√n)

n = number of dedicated setsZ = standard Gaussian variable

r = |μLRU- μBIP|/√(σLRU2 + σBIP

(For majority workloads r > 0.2)32-64 dedicated sets sufficient

Results for DIP

DIP reduces average MPKI by 21% and requires < two bytes storage

overhead

DIP (32 dedicated sets)BIP

DIP vs. Other Policies

(LRU+RND) (LRU+LFU) (LRU+MRU) DIP OPT Double

DIP bridges two-thirds of gap between LRU and OPT

DIP OPT Double(2MB)

(LRU+RND) (LRU+LFU) (LRU+MRU)

IPC Improvement

Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU

DIP Improves IPC by 9.3% on average

Outline

Introduction

Summary

LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction

Proposed changes to cache insertion policy (DIP) has:

1. Low hardware overheadRequires < two bytes storage overhead

2. Low complexityTrivial to implement. No changes to cache

structure

3. High performanceReduces misses by 21%. Two-thirds as good

as OPT

4. Robust across workloadsAlmost as good as LRU for LRU-friendly

workloads

Questions

source code:www.ece.utexas.edu/~qk/dip

DIP vs. LRU Across Cache Sizes

LRU DIP

8MB2MB4MB

tive t

art mcf equake swim health Avg_16

MPKI reduces till workload fits in the cache

DIP with 1MB 8-way L2 Cache

MPKI reduction with 8-way (19%) similar to 16-way (21%)

Interaction with Prefetching(%

DIP-NoPrefLRU-Pref

DIP-Pref

DIP also works well in presence of prefetching

(PC-based stride prefetcher)

mcf snippet

art snippet

health mpki

swim mpki

DIP Bypass

DIP (design and implementation)

Random Replacement (Success Function)

Cache contains K blocks and reference stream contains T

Prob that a block in cache survives 1 eviction = (1-1/K)Total number of evictions = (T-1)*Pmiss

Phit = (1-1/K)^(T-1)*Pmiss)Phit = (1-1/K)^(T-1)(1-Phit)

Iterative solution: Start at Phit=0

1. Phit = (1-1/K)^T

Adaptive Insertion Policies for High-Performance Caching

Documents

White Paper - Message Amplification · Edison: HP 3PAR Adaptive Flash Cache: A Competitive Comparison Page 3 Data Caching in Primary Storage Caching has been a feature of shared external

Web Caching - University of California, San Diego · Web Caching Page 1 of 24 Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions: Evidence and

An In-Memory Object Caching Framework with Adaptive Load ...people.cs.vt.edu/~butta/docs/eurosys15-mbal.pdf · An In-Memory Object Caching Framework with Adaptive Load Balancing

Caching and Scaling WordPress using Fragment Caching

Dynamic Adaptive Streaming over HTTP2.0. What’s in store ▪ All about – MPEG DASH, pipelining, persistent connections and caching ▪ Google SPDY - Past,

ACME: Adaptive Caching Using Multiple Experts - …cavazos/cisc879-spring2010/student... · ACME: Adaptive Caching Using Multiple Experts By Ismail Ari, Ahmed Amer, Robert Gramacy,

Mars 2020 Rover Adaptive Caching Assembly: Caching Martian … · 2021. 1. 3. · number of passive mechanisms that must operate in extreme Mars temperature and pressure conditions

Overview Introduction to ASP.NET caching Output caching Fragment caching Data caching 1

1,2-insertion / de-insertion

Caching and caching dependencies explained in Kentico CMS

Adaptive TTL-Based Caching for Content Delivery · 2019. 12. 19. · 1 Adaptive TTL-Based Caching for Content Delivery Soumya Basu, Aditya Sundarrajan, Javad Ghaderi, Sanjay Shakkottai,

IndexFS: Scaling File System Metadata Performance … Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion Kai Ren, Qing Zheng, Swapnil Patil, Garth Gibson

VeritasInfoScale™7.0 VirtualizationGuide- Solaris · Table1-1 Solaris:OracleVMServerforSPARC VxFSwriteback caching VxFSread caching VxVMread caching Cachingtakes place: Configuration

adaptive caching

Intro to Caching,Caching Algorithms and Caching Frameworks

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching

1 Ekow J. Otoo Frank Olken Arie Shoshani Adaptive File Caching in Distributed Systems

Cache Replacement Policy Using Map-based Adaptive Insertion

Hyperbolic Caching: Flexible Caching for Web Applications · Hyperbolic Caching: Flexible Caching for Web Applications Aaron Blankstein, Siddhartha Sen†, and Michael J. Freedman

A general topology-based framework for adaptive …paulino.ce.gatech.edu/.../EC_08_AGeneralTopologyBased.pdfA general topology-based framework for adaptive insertion of cohesive elements