Crictical Word First for Cache Misses

7/31/2019 Crictical Word First for Cache Misses

1/21

RHK.S96 1

Lecture 17: Memory Hierarchy

Five Ways to Reduce Miss Penalty(Second Level Cache)

Professor Randy H. Katz

Computer Science 252

Spring 1996


2/21

RHK.S96 2

Review: Summary

3 Cs: Compulsory, Capacity, Conflict Misses

Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity

3. Reducing Conflict Misses via Victim Cache

4. Reducing Conflict Misses via Pseudo-Associativity

5. Reducing Misses by HW Prefetching Instr, Data

6. Reducing Misses by SW Prefetching Data

7. Reducing Capacity/Conf. Misses by Compiler Optimizations

Remember danger of concentrating on just oneparameter when evaluating performance

Today: reducing Miss penalty & Hit time

CPUtime = IC x (CPIexecution +

Mem Access per instruction x Miss Rate x Miss penalty) x clock cycle time


3/21

RHK.S96 3

1. Reducing Miss Penalty: Read

Priority over Write on Miss

Write back with write buffers offer RAW conflictswith main memory reads on cache misses

If simply wait for write buffer to empty mightincrease read miss penalty by 50% (old MIPS 1000)

Check write buffer contents before read;if no conflicts, let the memory access continue

Write Back? Read miss replacing dirty block

Normal: Write dirty block to memory, and then do the read

Instead copy the dirty block to a write buffer, then do the read,and then do the write

CPU stall less since restarts as soon as do read


4/21

RHK.S96 4

2. Subblock Placement to

Reduce Miss Penalty

Dont have to load full block on a miss

Have bits per subblock to indicate valid

(Originally invented to reduce tag storage)

Valid Bits


5/21

RHK.S96 5

3. Early Restart and Critical

Word First

Dont wait for full block to be loaded before restartingCPU

Early restartAs soon as the requested word of the block arrives,

send it to the CPU and let the CPU continue execution

Critical Word FirstRequest the missed word first from memoryand send it to the CPU as soon as it arrives; let the CPU continueexecution while filling the rest of the words in the block. Also calledwrapped fetchand requested word first

Generally useful only in large blocks,

Spatial locality a problem; tend to want nextsequential word, so not clear if benefit by early restart


6/21

RHK.S96 6

4. Non-blocking Caches to

reduce stalls on misses

Non-blocking cacheor lockup-free cacheallowing thedata cache to continue to supply cache hits during amiss

hit under miss reduces the effective miss penaltyby being helpful during a miss instead of ignoring therequests of the CPU

hit under multiple miss or miss under miss may

further lower the effective miss penalty by overlappingmultiple misses Significantly increases the complexity of the cache controller as

there can be multiple outstanding memory accesses


7/21

RHK.S96 7

Value of Hit Under Miss for SPEC

FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26

Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19

8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

H i t U n d e r i M i s s e s

Avg.

Mem.

Access

Time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

eqntott

espre

sso

xlisp

compr

ess

mdljs

p2

ear

fpp

pp

tomc

atv

swm2

56

do

duc

su2cor

wave5

mdljd

p2

hydro

2d

alv

inn

nasa7

spice2

g6

ora

0->1

1->2

2 ->64

Base

Integer Floating Point

Hit under n Misses


8/21

RHK.S96 8

5th Miss Penalty Reduction:

Second Level Cache L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 +Miss PenaltyL2)

Definitions: Local miss rate misses in this cache divided by the total

number of memory accesses to this cache(Miss rateL2) Global miss ratemisses in this cache divided by the total

number of memory accesses generated by the CPU(Miss RateL1 x Miss RateL2)


9/21

RHK.S96 9

Comparing Local and GlobalMiss Rates

32 KByte 1st level cache;Increasing 2nd level cache

Global miss rate close tosingle level cache rateprovided L2 >> L1

Dont use local miss rate

L2 not tied to CPU clockcycle

Cost & A.M.A.T.

Generally Fast Hit Timesand fewer misses

Since hits are few, targetmiss reduction

Linear

Log

Cache Size

Cache Size


10/21

RHK.S96 10

Reducing Misses: Which apply

to L2 Cache?

Reducing Miss Rate1. Reduce Misses via Larger Block Size

2. Reduce Conflict Misses via Higher Associativity

3. Reducing Conflict Misses via Victim Cache

4. Reducing Conflict Misses via Pseudo-Associativity

5. Reducing Misses by HW Prefetching Instr, Data

6. Reducing Misses by SW Prefetching Data

7. Reducing Capacity/Conf. Misses by Compiler Optimizations


11/21

RHK.S96 11

Relat iv e CPU Tim e

Block S ize

1

1 .1

1 .2

1 .3

1 .4

1 .5

1 .6

1 .7

1 .8

1 .9

2

1 6 3 2 6 4 1 2 8 2 5 6 5 1 2

1 . 3 6

1 . 2 8 1 . 2 71 . 3 4

1 . 5 4

1 . 9 5

L2 cache block size & A.M.A.T.

32KB L1, 8 byte path to memory


12/21

RHK.S96 12

Reducing Miss Penalty Summary

Five techniques Read priority over write on miss

Subblock placement

Early Restart and Critical Word First on miss

Non-blocking Caches (Hit Under Miss)

Second Level Cache

Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in

between


13/21

RHK.S96 13

Review: Improving Cache

Performance

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.


14/21

RHK.S96 14

1. Fast Hit times via Small and

Simple Caches

Why Alpha 21164 has 8KB Instruction and8KB data cache + 96KB second level cache

Direct Mapped, on chip


15/21

RHK.S96 15

2. Fast hits by AvoidingAddress Translation

Send virtual address to cache? Called Virtually AddressedCacheor just Virtual Cachevs. Physical Cache Every time process is switched logically must flush the cache; otherwise

get false hits

Cost is time to flush + compulsory misses from empty cache

Dealing with aliases(sometimes called synonyms);Two different virtual addresses map to same physical address

I/O must interact with cache, so need virtual address

Solution to aliases HW that guarantees that every cache block has unique physical address

SW guarantee: lower n bits must have same address; as long as coversindex field & direct mapped, they must be unique;called page coloring

Solution to cache flush Addprocess identifier tagthat identifies process as well as address within

process: cant get a hit if wrong process


16/21

RHK.S96 16

2. Avoiding Translation:

Process ID impact Black is uniprocess

Light Gray is multiprocesswhen flush cache

Dark Gray is multiprocesswhen use Process ID tag

Y axis: Miss Rates up to 20%

X axis: Cache size from 2 KB

to 1024 KB


17/21

RHK.S96 17

2. Avoiding Translation: Index

with Physical Portion of Address

If index is physical part of address, can starttag access in parallel with translation so thatcan compare to physical tag

Limits cache to page size: what if want biggercaches and uses same trick? Higher associativity

Page coloring

Page Address Page Offset

Address Tag Index Block Offset


18/21

RHK.S96 18

Pipeline Tag Check and Update Cache as separate stages;current write tag check & previous write cache update

Only Write in the pipeline; empty during a miss

In color is Delayed Write Buffer; must be checked onreads; either complete write or read from buffer

3. Fast Hit Times Via Pipelined Writes


19/21

RHK.S96 19

4. Fast Writes on Misses Via

Small Subblocks

If most writes are 1 word, subblock size is 1 word, & writethrough then always write subblock & tag immediately

Tag match and valid bit already set: Writing the block was proper, &nothing lost by setting valid bit on again.

Tag match and valid bit not set: The tag match means that this is theproper block; writing the data into the subblock makes it appropriate toturn the valid bit on.

Tag mismatch: This is a miss and will modify the data portion of theblock. As this is a write-through cache, however, no harm was done;

memory still has an up-to-date copy of the old value. Only the tag to theaddress of the write and the valid bits of the other subblock need bechanged because the valid bit for this subblock has already been set

Doesnt work with write back due to last case


20/21

RHK.S96 20

Cache Optimization SummaryTechnique MR MP HT Complexity

Larger Block Size + 0Higher Associativity + 1Victim Caches + 2

Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0

Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2

Non-Blocking Caches + 3Second Level Caches + 2

Small & Simple Caches + 0Avoiding Address Translation + 2Pipelining Writes + 1


21/21

RHK.S96 21

What is the Impact of What

Youve Learned About Caches?

1960-1985: Speed= (no. operations)

1995

PipelinedExecution &Fast Clock Rate

Out-of-Ordercompletion

SuperscalarInstruction Issue

1995: Speed =(non-cached memory accesses)

What does this mean for

Compilers?,Operating Systems?, Algorithms? Data Structures?

1

10

100

1000

1

9

8

0

1

9

8

1

1

9

8

2

1

9

8

3

1

9

8

4

1

9

8

5

1

9

8

6

1

9

8

7

1

9

8

8

1

9

8

9

1

9

9

0

1

9

9

1

1

9

9

2

1

9

9

3

1

9

9

4

1

9

9

5

1

9

9

6

1

9

9

7

1

9

9

8

1

9

9

9

2

0

0

0

DRAM

CPU

Crictical Word First for Cache Misses

Documents

Accelerating MySQL with JIT Compilers...10 OLTP workloads are mostly front-end CPU stalls Instruction cache misses, branch mispredictions, ITLB misses Use profiling information to

Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Chapter Overview - HUJI CSEcomarc/slides/lect7-1.pdf · 2006-06-08 · Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty

Accelerating Dependent Cache Misses with an Enhanced Memory …omutlu/pub/enhanced-memory... · 2016-04-28 · Accelerating Dependent Cache Misses with an Enhanced Memory Controller

DataCollecting Performance Data PIwith I -C · PAPI_ L2_: el 1 data cache reads s misses hits s s RPAPI_ L2_R : dLel 1 ins t ruc ticac ha ds s misses hits s I Preset Events it s ¾d

COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses

Accelerating Dependent Cache Misses with an Enhanced ...hps.ece.utexas.edu/people/miladh/pub/emc_isca2016.pdf · Accelerating Dependent Cache Misses with an ... University of Texas

Cache Replacement Policies - ECE/CS 752 Fall 2017 · Cache Replacement Policies ... Replacement policies affect capacity and conflict misses ... used policies,” IEEE Trans.Computers,

Thread-Level Parallelism — Simultaneous MultiThreading ... › ~htseng › classes › cse203_2019fa › 15_SMTCMP.pdf$ SMT processors can better utilize hardware during cache misses

Performance of coherence protocols...Lecture 14 Architecture of Parallel Computers 1 Performance of coherence protocols Cache misses have traditionally been classified into four categories:

Cache Performance Metrics Miss Rate Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages):

Chapter Overviecomarc/slides/lect7-1.pdfChapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6

Managing Cache Coherency on Cortex-M7 Based MCUsww1.microchip.com/downloads/en/DeviceDoc/Managing-Cache-Cohe… · The cache hits only update the cache memory. Cache misses on a write,

CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache

Cache Performance forslingn/publications/mm_cache/m… · Cold start effects • All initial accesses are compulsory misses ... Irfanview v3.15 Image Viewer 1,189,234 PowerDVD v2.55

Types of Cache Misses: The Three C’s

1 Dynamic Memory Management. 2 What’s worse? 50% of instructions/data are cache misses (fetched from RAM) 1% of instructions/data are page faults

HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania

Part 1: Introductiontinoosh/cmpe311/notes/multicore.pdf · ECE611 Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Relieves

HACL: A Verified Modern Cryptographic Librarybranching, memory access patterns, cache hits and misses, power consumption, etc. Cryptographic Security Thecodeforeachcryptographiccon-struction