Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB

Memory Hierarchy Design

Adapted from CS 252 Spring 2006 UC BerkeleyCopyright (C) 2006 UCB

Since 1980, CPU has outpaced DRAM ...

CPU60% per yr2X in 1.5 yrs

DRAM9% per yr2X in 10 yrs

10

DRAM

CPU

Performance(1/latency)

100

1000

1980

2000

1990

Year

Gap grew 50% per year

Q. How do architects address this gap? A. Put smaller, faster “cache”

memories between CPU and DRAM.

Create a “memory hierarchy”.

1977: DRAM faster than microprocessors

Apple ][ (1977)

Steve WozniakSteve

Jobs

CPU: 1000 ns DRAM: 400 ns

Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bit

DiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Memory Hierarchy: Apple iMac G5

iMac G51.6 GHz

07 Reg L1 Inst

L1 Data L2 DRA

M Disk

Size 1K 64K 32K 512K 256M 80GLatenc

yCycles,

Time

1,0.6 ns

3,1.9 ns

3,1.9 ns

11,6.9 ns

88,55 ns

107,12 ms

Let programs address a memory space that scales to the disk size, at

a speed that is usually as fast as register access

Managed by compiler

Managed by hardware

Managed by OS,hardware,

application

Goal: Illusion of large, fast, cheap memory

iMac’s PowerPC 970: All caches on-chip

(1K)

Registers

512KL2

L1 (64K Instruction)

L1 (32K Data)

Memory Hierarchy

• Provides (virtually) unlimited amounts of fast memory.

• Takes advantage of locality.– Principle of locality (p.47): Programs tend to

reuse data and instructions they have used recently. (Temporal and Spatial)

– Most programs do not access all code or data uniformly.

The Principle of Locality• The Principle of Locality:

– Program access a relatively small portion of the address space at any instant of time.

• Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

• Last 15 years, HW relied on locality for speed

It is a property of programs which is exploited in machine design.

Programs with locality cache well ...

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

ory

Ad

dre

ss (

on

e d

ot

per

acc

ess)

SpatialLocality

Temporal Locality

Bad locality behavior

Multi-Level Memory Hierarchy

typical valuessmallerfastermore expensive

• Goal: To provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level

• All data in one level are found in the level below. (The upper level is a subset of the lower level.)

Memory Hierarchy

Why Important?

• Microprocessor performance improved 35% per year until 1986, and 55% per year since 1987.

• Memory: 7% per year performance improvement in latency

• Processor-memory performance gap

Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example: Block X)

– Hit Rate: the fraction of memory access found in the upper level

– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

• Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)

– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

• Hit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Cache Measures

• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,

miss rate to average memory access time in memory • Average memory-access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

• Miss penalty: time to replace a block from lower level, including time to replace in CPU

– access time: time to lower level

= f(latency to lower level)– transfer time: time to transfer block

=f(BW between upper & lower levels)

Cache Performance

Memory stall cycles =

Number of misses x Miss penalty =

MissesIC x x Miss penalty = Instruction

Memory accesses IC x x Miss rate x Miss penalty Instruction

Example

• Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?

4 Questions for Memory Hierarchy

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level? (Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Block Placement

• Direct Mapped– (block address) mod (number of blocks in

cache)

• Fully-associative

• Set associative– (block address) mod (number of sets in

cache)– n-way set associative: n blocks in a set

Block Placement

Block Placement

• The vast majority of processor caches today are direct mapped, two-way set associative, or four-way set associative.

Block Identification

used to check all the blocksin the set

used to selectthe set

address of thedesired data within the block

address

DECstation 3100 cacheAddress (showing bit positions)

16 14 Byteoffset

Valid Tag Data

Hit Data

16 32

16Kentries

16 bits 32 bits

31 30 17 16 15 5 4 3 2 1 0

block size = one word

64KB Cache Using 4-Word BlocksAddress (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

Block Size vs. Miss Rate• Increasing block size: decreasing miss rate.

–The instruction references have better spatial locality.

• Serious problem with just increasing the block size: miss penalty increases.

–Miss penalty: the time required to fetch the block from the next lower level of the hierarchy and load it into the cache.

–Miss penalty = (latency to the first word) + (transfer time for the rest of the block)

Block Size vs. Miss Rate

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

Block Identification

• If the total cache size is kept the same, increasing associativity– increases the number of blocks per set– decreases the size of the index– increases the size of the tag

Block Replacement

• With direct-mapped– Only one block frame is checked for a hit, and

only that block can be replaced.

• With fully associative or set associative,– Random– Least-recently used (LRU)– First in, First out (FIFO)

• See Figure 5.6 on p. 400.

Write Strategy• Write through

– The information is written to both the block in the cache and to the block in the memory.

– Easier to implement, the cache is always clean.

• Write back– The information is written only to the block in

the cache. The modified block is written to main memory only when it is replaced.

– Less memory bandwidth, less power

• Write through with write buffer

What happens on a write?Write-Through Write-Back

Policy

Data written to cache block

also written to lower-level

memory

Write data only to the cache

Update lower level when a

block falls out of the cache

Debug Easy HardDo read misses produce writes? No Yes

Do repeated writes make it to lower level?

Yes No

Additional option -- let writes to an un-cached address allocate a new cache line (“write-

allocate”).

Write Buffers for Write-Through Caches

Q. Why a write buffer ?

ProcessorCache

Write Buffer

Lower Level

Memory

Holds data awaiting write-through to lower level memory

A. So CPU doesn’t stall

Q. Why a buffer, why not just one register ?

A. Bursts of writes arecommon.Q. Are Read After

Write (RAW) hazards an issue for write buffer?

A. Yes! Drain buffer before next read, or send read 1st after check write buffers.

Write Miss Policy

• Write allocate– The block is allocated on a write miss,

followed by the write hit.

• No-write allocate– Write misses do not affect the cache.– Unusual– The block is modified only in the memory.

Exampl

• Assume a fully associative write-back cache with many cache entries that starts empty. What are the number of hits and misses when using no-write allocate versus write allocate?

Write Mem[100];

Write Mem[100];

Read Mem[200];

Write Mem[200];

Write Mem[100];

Write Miss Policy

• Write-back caches use write allocate normally.

• Write-through caches often use no-write allocate.

Example

• Assuming a cache of 16K blocks (1 block = 1 word) and 32-bit addressing, find the total number of sets and the total number of tag bits that are direct-mapped, 2-way set-associative, 4-way set-associative, and fully associative. Assume that this machine is byte-addressable.

Cache Performance

Average memory access time =

Hit time + Miss rate X Miss penalty

Documents

Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB