Memory Hierarchy— Motivation, Definitions, Four Questions

1

Page 1

Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy,

Improving Performance

Professor Alvin R. LebeckComputer Science 220

ECE 252Fall 2006

2© Alvin R. Lebeck 2006

Admin

• Some stuff will be review…some will not• Homework #4• Work on your projects• Costi will lecture next Tuesday• There will be two papers to read

2

Page 2


Outline of Today’s Lecture

• Most of today should be review…later stuff will not• The Memory Hierarchy• Direct Mapped Cache.• Two-Way Set Associative Cache• Fully Associative cache• Replacement Policies• Write Strategies• Memory Hierarchy Performance• Improving Performance


The Big Picture: Where are We Now?

• The Five Classic Components of a Computer

• Today’s Topic: Memory System

Control

Datapath

Memory

ProcessorInput

Output

3

Page 3

CPS 220 5© Alvin R. Lebeck 2006

Who Cares About the Memory Hierarchy?• Processor Only Thus Far in Course:

– CPU cost/performance, ISA, Pipelined Execution

CPU-DRAM Gap

• 1980: no cache in µproc; 1995 2-level cache, 60% trans. on Alpha 21164 µproc (150 clock cycles for a miss!)

1

10

100

1000

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU


The Motivation for Caches

• Motivation:– Large memories (DRAM) are slow– Small memories (SRAM) are fast

• Make the average access time small by:– Servicing most accesses from a small, fast memory.

• Reduce the bandwidth required of the large memory

Processor

Memory System

Cache DRAM

4

Page 4


Levels of the Memory Hierarchy

CPU Registers100s Bytes<1s ns

CacheK-M Bytes1-30 ns

Main MemoryG Bytes100ns-1us$250 2GB10-8 cents/bit

Disk40-200 G Bytes3-15 ms80GB $11010-10 cents/bit

CapacityAccess TimeCost

Tape400 GBsec-min$80 400GB10-11 cents/bit

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache controller8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger


The Principle of Locality

• The Principle of Locality:– Program access a relatively small portion of the address space at

any instant of time.– Example: 90% of time in 10% of the code

• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon.– Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon.

Address Space0 2n

Probabilityof reference

5

Page 5


Memory Hierarchy: Principles of Operation

• At any given time, data is copied between only 2 adjacent levels:

– Upper Level (Cache) : the one closer to the processor» Smaller, faster, and uses more expensive technology

– Lower Level (Memory): the one further away from the processor» Bigger, slower, and uses less expensive technology

• Block:– The minimum unit of information that can either be present or not

present in the two level hierarchy

Lower Level(Memory)Upper Level

(Cache)To Processor

From ProcessorBlk X

Blk Y


Memory Hierarchy: Terminology

• Hit: data appears in some block in the upper level (example: Block X)

– Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

• Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)– Miss Penalty = Time to replace a block in the upper level +

Time to deliver the block the processor

• Hit Time << Miss Penalty

Lower Level(Memory)Upper Level

(Cache)To Processor

From ProcessorBlk X

Blk Y

6

Page 6


Four Questions for Memory Hierarchy Designers

• Q1: Where can a block be placed in the upper level?(Block placement)

• Q2: How is a block found if it is in the upper level?(Block identification)

• Q3: Which block should be replaced on a miss?(Block replacement)

• Q4: What happens on a write?(Write strategy)


● Direct Mapped cache is an array of fixed size frames.

● Each frame holds consecutive bytes (cache block) of main memory data.

● The Tag Array holds the Block Memory Address.

● A valid bit associated with each cache block tells if the data is valid.

Direct Mapped Cache

Cache-Index = (<Address> Mod (Cache_Size))/ Block_SizeBlock-Offset = <Address> Mod (Block_Size)Tag = <Address> / (Cache_Size)

● Cache Index: The location of a block (and it’s tag) in the cache.

● Block Offset: The byte location in the cache block.

7

Page 7


The Simplest Cache: Direct Mapped Cache

Memory

4 Byte Direct Mapped Cache

Memory Address0123456789ABCDEF

Cache Index0123

• Location 0 can be occupied by data from:

– Memory location 0, 4, 8, ... etc.– In general: any memory location

whose 2 LSBs of the address are 0s– Address<1:0> => cache index

• Which one should we place in the cache?

• How can we tell which one is in the cache?


Direct Mapped Cache (Cont.)

For a Cache of 2M bytes with block size of 2L

bytes– There are 2M-L cache blocks,

– Lowest L bits of the address are Block-Offset bits

– Next (M - L) bits are the Cache-Index.

– The last (32 - M) bits are the Tag bits.

block offsetCache IndexTag

Data Address

L bitsM-L bits32-M bits

8

Page 8


Example: 1-KB Cache with 32B blocks:

Cache Index = (<Address> Mod (1024))/ 32

Block-Offset = <Address> Mod (32)

Tag = <Address> / (1024)

1K = 210 = 102425 = 32

Direct Mapped Cache Data

Byte 0Byte 1Byte 30Byte 31

Cache TagValid bit

. . . .32-byte block22 bits

32cacheblocks

5 bits block offset5 bits Cache Index22 bits Tag

Address


Example: 1KB Direct Mapped Cache with 32B Blocks• For a 1024 (210) byte cache with 32-byte blocks:

– The uppermost 22 = (32 - 10) address bits are the Cache Tag– The lowest 5 address bits are the Byte Select (Block Size = 25)– The next 5 address bits (bit5 - bit9) are the Cache Index

0431 9Cache Index

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

Byte Select

9

Page 9


Example: 1K Direct Mapped Cache0431 9 Cache Index

:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x00

Byte Select=Cache Miss

10

1

0xxxxxxx

0x004440



:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0123

:

Cache Data

31


Byte 992Byte 1023 :

Cache Tag

Byte Select0x00

Byte Select=

11

1

0x0002fe

0x004440

New Bloick of data

10

Page 10



:

Cache Tag

0x000050 0x01

0x000050

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select0x08

Byte Select=Cache Hit

11

1

0x0002fe

0x004440



:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x04

Byte Select=Cache Miss

11

1

0x0002fe

0x004440

11

Page 11



:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x04

Byte Select=

11

1

0x0002fe

0x002450 New Block of data


Block Size Tradeoff

• In general, larger block size takes advantage of spatial locality BUT:

– Larger block size means larger miss penalty:» Takes longer time to fill up the block

– If block size is too big relative to cache size, miss rate will go up» Too few cache blocks

• Average Memory Access Time:– Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size Block Size

12

Page 12


A N-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operating in parallel

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result

Cache DataCache Block 0

Cache TagValid

:: :


Cache Tag Valid

: ::

Cache Index

Mux 01SEL1 SEL0

Cache Block

CompareAdr. Tag

Compare

OR

Hit

Adr. Tag


Advantages of Set associative cache

• Higher Hit rate for the same cache size.• Fewer Conflict Misses.• Can have a larger cache

but keep the index smaller (same size as virtual page index)

13

Page 13


Disadvantage of Set Associative Cache• N-way Set Associative Cache versus Direct Mapped Cache:

– N comparators vs. 1 (power penalty)– Extra MUX delay for the data (delay penalty)– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFOREHit/Miss:

– Possible to assume a hit and continue, recover later if it is a miss.


Cache Tag Valid

: ::


Cache TagValid

:: :

Cache Index

Mux 01SEL1 SEL0

Cache Block

CompareAdr Tag

Compare

OR

Hit


And yet Another Extreme Example:Fully Associative cache

• Fully Associative Cache -- push the set associative idea to its limit!

– Forget about the Cache Index– Compare the Cache Tags of all cache entries in parallel– Example: Block Size = 32B blocks, we need N 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache DataByte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :


Cache Tag

Byte SelectEx: 0x01

X

XX

X

X

14

Page 14


Sources of Cache Misses• Compulsory (cold start or process migration, first

reference): first access to a block– Miss in infinite cache– “Cold” fact of life: not a whole lot you can do about it– Prefetch, larger block gives implicit prefetch

• Capacity:– Miss in Fully Associative cache with LRU replacement– Cache cannot contain all blocks access by the program– Solution: increase cache size

• Conflict (collision):– Hit in Fully Associative, miss in Set-Associative– Multiple memory locations mapped

to the same cache location– Solution 1: increase cache size– Solution 2: increase associativity

• Invalidation: other process (e.g., I/O) updates memory


Sources of Cache Misses

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Invalidation Miss

Big Medium Small

Note:If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

Same Same Same

Conflict Miss High Medium Zero

Low(er) Medium High

Same Same/lower Same/lower

15

Page 15


The Need to Make a Decision!

• Direct Mapped Cache:– Each memory location can map to only 1 cache location (frame)– No need to make any decision :-)

» Current item replaced the previous item in that cache location

• N-way Set Associative Cache:– Each memory location have a choice of N cache frames

• Fully Associative Cache:– Each memory location can be placed in ANY cache frame

• Cache miss in a N-way Set Associative or Fully Associative Cache:

– Bring in new block from memory– Throw out a cache block to make room for the new block– We need to make a decision on which block to throw out!


Cache Block Replacement Policy• Random Replacement:

– Hardware randomly selects a cache item and throw it out

• Least Recently Used:– Hardware keeps track of the access history– Replace the entry that has not been used for the longest time.– For two way set associative cache only needs one bit for LRU replacement.

• Example of a Simple “Pseudo” Least Recently Used Implementation:

– Assume 64 Fully Associative Entries– Hardware replacement pointer points to one cache entry– Whenever an access is made to the entry the pointer points to:

» Move the pointer to the next entry– Otherwise: do not move the pointer

:

Entry 0Entry 1

Entry 63

ReplacementPointer

16

Page 16


Cache Write Policy: Write Through versus Write Back

• Cache read is much easier to handle than cache write:

– Instruction cache is much easier to design than data cache

• Cache write:– How do we keep data in the cache and memory consistent?

• Two options (decision time again :-)– Write Back: write to cache only. Write the cache block to memory

when that cache block is being replaced on a cache miss.» Need a “dirty bit” for each cache block» Greatly reduce the memory bandwidth requirement» Can buffer the data in a write-back buffer» Control can be complex

– Write Through: write to cache and memory at the same time.» What!!! How can this be? Isn’t memory too slow for this?


Write Buffer for Write Through

• A Write Buffer is needed between the Cache and Memory

– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:– Typical number of entries: 4– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

• Memory system designer’s nightmare:– Store frequency (w.r.t. time) > 1 / DRAM write cycle– Write buffer saturation

ProcessorCache

Write Buffer

DRAM

17

Page 17


Write Buffer Saturation

• Store frequency (w.r.t. time) -> 1 / DRAM write cycle– If this condition exist for a long period of time (CPU cycle time too

quick and/or too many store instructions in a row):» Store buffer will overflow no matter how big you make it» The CPU Cycle Time << DRAM Write Cycle Time

• Solution for write buffer saturation:– Use a write back cache– Install a second level (L2) cache

ProcessorCache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache


Write Policies• We know about write-through vs. write-back• Assume: a 16-bit write to memory location 0x00

causes a cache miss.• Do we change the cache tag and update data in the

block?Yes: Write AllocateNo: Write No-Allocate

• Do we fetch the other data in the block?Yes: Fetch-on-Write (usually do write-allocate)No: No-Fetch-on-Write

• Write-around cache– Write-through no-write-allocate

18

Page 18


Sub-block Cache (Sectored)• Sub-block:

– Share one cache tag between all sub-blocks in a block– Each sub-block within a block has its own valid bit– Example: 1 KB Direct Mapped Cache, 32-B Block, 8-B Sub-block

» Each cache entry will have: 32/8 = 4 valid bits

• Miss: only the bytes in that sub-block are brought in.– reduces cache fill bandwidth (penalty).

0123

:

Cache Data

:

SB0’

s V B

it

:31

Cache Tag SB1’

s V B

it

:

SB2’

s V B

it

:

SB3’

s V B

it

:

Sub-block0Sub-block1Sub-block2Sub-block3

: B0B7: B24B31

Byte 992Byte 1023


Review: Four Questions for Memory Hierarchy Designers

• Q1: Where can a block be placed in the upper level? (Block placement)

– Fully Associative, Set Associative, Direct Mapped

• Q2: How is a block found if it is in the upper level?(Block identification)– Tag/Block

• Q3: Which block should be replaced on a miss? (Block replacement)

– Random, LRU

• Q4: What happens on a write? (Write strategy)

– Write Back or Write Through (with Write Buffer)

19

Page 19


Cache Performance

CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time

Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)

Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty


Cache Performance

CPUtime = IC x (CPIexecution + (Mem accesses per instruction x Miss rate x Miss penalty)) x Clock cycle time

hits are included in CPIexecution

Misses per instruction = Memory accesses per instruction x Miss rate

CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

20

Page 20


Example

• Miss penalty 50 clocks• Miss rate 2%• Base CPI 2.0• 1.33 references per instruction• Compute the CPUtime

• CPUtime = IC x (2.0 + (1.33 x 0.02 x 50)) x Clock• CPUtime = IC x 3.33 x Clock• So CPI increased from 2.0 to 3.33 with a 2% miss rate


Example 2

• Two caches: both 64KB, 32 byte blocks, miss penalty 70ns, 1.3 references per instruction, CPI 2.0 w/ perfect cache

• direct mapped– Cycle time 2ns– Miss rate 1.4%

• 2-way associative– Cycle time increases by 10%– Miss rate 1.0%

• Which is better?– Compute average memory access time– Compute CPU time

21

Page 21


Example 2 Continued

• Ave Mem Acc Time =Hit time + (miss rate x miss penalty)

– 1-way: 2.0 + (0.014 x 70) = 2.98ns– 2-way: 2.2 + (0.010 x 70) = 2.90ns

• CPUtime = IC x CPIexec x Cycle– CPIexec = CPIbase + ((memacc/inst) x Miss rate x miss penalty)– Note: miss penalty x cycle time = 70ns

– 1-way: IC x ((2.0 x 2.0) + (1.3x0.014x70)) = 5.27 x IC– 2-way: IC x ((2.0 x 2.2) + (1.3x0.010x70)) = 5.31 x IC

Documents

Memory Hierarchy— Motivation, Definitions, Four Questions