CMP 301A Computer Architecture 1 Lecture 2. Outline zDirect mapped caches: Reading and writing policies zMeasuring cache performance zImproving cache

CMP 301AComputer

Architecture 1

Lecture 2

Outline

Direct mapped caches: Reading and writing

policies

Measuring cache performance

Improving cache performance

Enhancing main memory performance

Flexible placement of blocks: Associativity

Multilevel caches

Read and Write Policies

Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache

Cache write: How do we keep data in the cache and memory consistent?

Two write options: Write Through: write to cache and memory at the same time.

Isn’t memory too slow for this?

Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss.

Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex

Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write

cycleMemory system designer’s nightmare:

Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation

Problem: Write buffer may hold updated value of location needed by a read miss??!!

ProcessorCache

Write Buffer

DRAM

Write Allocate versus Not AllocateAssume: a 16-bit write to memory location 0x0 and causes a miss

Do we read in the rest of the block (Byte 2, 3, ... 31)?Yes: Write AllocateNo: Write Not Allocate

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x00

Ex: 0x00

0x00

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

6

Measuring cache performanceImpact of cache miss on Performance

Suppose a processor executes at Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations (involving data) get 100 cycle miss penalty

Suppose that 1% of instructions get same miss penalty

ninstructioper stalls average CPI ideal CPI

miss

cycle

Inst_Mop

miss

instr.

Inst_Mop

miss

cycle

Data_Mops

miss

instr.

Data_Mops

instr.

cycles CPI

1.5instr.

cycle)0.10.31.1(

10001.0110010.030011CPI

..

78% of the time the proc is stalled waiting for memory!

7

Improving Cache Performance

Average memory access time(AMAT) =Hit time + Miss rate x Miss penalty

To improve performance:• reduce the hit time• reduce the miss rate• reduce the miss penalty

Enhancing main memory performance

Increasing memory and bus width Transfer more

words every clock cycle

Isn’t too much wiring

Using interleaved memory organization Reduce access time

with less wiring Double Date Rate

(DDR) DRAMs

Enhancing main memory performance (Cont)

10


0 1 2 3 4 5 6 70 1 2 3Set Number

Cache

Fully (2-way) Set DirectAssociative Associative Mappedanywhere anywhere in only into

set 0 block 4 (12 mod 4) (12 mod 8)

0 1 2 3 4 5 6 7 8 91 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9

2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9

3 30 1

Memory

Block Number

block 12 can be placed

11


A Two-way Set Associative CacheN-way set associative: N entries for each Cache Index

N direct mapped caches operates in parallelExample: Two-way set associative cache

Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

And yet Another Extreme Example: Fully Associative

Fully Associative Cache -- push the set associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit

comparatorsBy definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

14

Replacement PolicyIn an associative cache, which block from a set should be evicted when the set becomes full?

• Random

• Least-Recently Used (LRU)• LRU cache state must be updated on every access• true implementation only feasible for small sets (2-way)

• First-In, First-Out (FIFO) a.k.a. Round-Robin• used in highly associative caches

• Not-Most-Recently Used (NMRU)• FIFO with exception for most-recently used block or blocks

Replacement only happens on misses

Documents

CMP 301A Computer Architecture 1 Lecture 2. Outline zDirect mapped caches: Reading and writing policies zMeasuring cache performance zImproving cache