CSCI 232© 2005 JW Ryder1 Cache Memory Systems Introduced by M.V. Wilkes (“Slave Store”) Appeared in IBM S360/85 first commercially

CSCI 232 © 2005 JW Ryder 1

Cache Memory Systems

• Introduced by M.V. Wilkes (“Slave Store”)

• Appeared in IBM S360/85 first commercially


Motivations• Main memory access time 5 to

25 times slower than accessing register– on chip vs. off chip issues et al.

• Can’t have too many registers in the CPU

• Program locality should allow small fast buffer between the CPU and MM

• Should be managed by hardware to be effective


Motivations Continued

• Most of time, MM data has to be found in cache to be worth it

• Can only happen if dynamic locality is tracked well

• Automatic management, transparent to Instruction Set Architecture (ISA)


Access and Cost

• Tcache < TMM

• Treg < Tcache

• Creg > Ccache > CMM (per bit - real estate)


Cache vs. Registers

• Cache– Locality: Tracked dynamically– Management: Hardware– Expandability: Easy– ISA Visibility: Invisible (mostly)

• Registers– Locality: Static by compiler– Management:

Software/Programmer– Expandability: Not possible– ISA Visibility: Visible


4 2

1

5

3

Simple Cache Based System

MMRegisters

CPUCache


Read Operation• See if desired MM word is in

the cache (1)

• If it is (‘cache hit’) get it from the cache (2)

• If it isn’t (cache miss) get it from MM - supply simultaneously to CPU and cache (3)– Make room in cache by selecting

a victim - may have to be written back to MM (4) and then copy installed (5)

• CPU stalls until missing word is supplied


Locality of Reference

• Temporal– If this word is needed now, then

there is a good chance it will be needed again

• Spatial– When the fetch from MM is done,

it actually gets a chunk of words– Probably some word near the

word will also be needed

• Registers use TLOR

• Caches use TLOR, SLOR


Selecting a Victim

• Must not be accessed in near future

• Maintain a history of usage

• Basic unit of transfer between cache and MM is a block (line) consisting of 2b words– b is small (2 - 4)

• On miss, block containing missing word loaded into cache (by cache controller)

• Ensures neighboring words also cached (SLOR)


Addressing Cache• Same as memory• Cache stores entries in form

– <block address, contents of words in block, usage info>

• Cache controller compares address issued by CPU with address field of cache entries to determine a hit or miss

• Transfer between Cache and CPU is only a word or 2

• Between Cache and MM in block(s)• Hit - Data back from cache in 1

clock cycle• Miss - 15 - 20 cycles


Functions of Cache Controller

• Given an address issued by CPU, CC should be able to determine if block containing word is in cache or not– requires assoc. logic / comparators

• CC needs to keep track of usage of blocks in cache

• Hardware logic for victim selection• May need to write back line (victim)

from cache to MM • Must implement a placement policy

that determines how blocks from MM are placed in cache

• Replacement policy needed only if there is a choice for victim


Cache Loading Strategies• Load block into cache from MM

only on a miss• Prefetch (anticipating a miss) block

into cache– Prefetch on Miss: On block i miss,

prefetch block i + 1 too

– Always Prefetch: Prefetch block i + 1 on first reference to block i

– Tagged Prefetch: Prefetch on miss and prefetch block i + 1 if a reference to a previously prefetched block is made for the first time

– Keep prefetching if last prefetch was useful

– Tags distinguish not yet accessed blocks from others


More Strategies

• Previous prefetches are 1 block, can be > 1 block

• Selective Fetch– Don’t fetch shared writeable blocks

– Used in many systems to avoid cache incoherence (multiprocessors)


Load-Thru / Read-Thru

• Missing word forwarded to CPU and cache concurrently

• Remaining words of block are then fetched in wraparound fashion

0 1 2 … … 2k

w

• Order of loading for remaining words in block

• Wrapping around saves pointer resetting

• Write pointer already positioned

• Not needed if load can be in one shot


Cache with Writeback Buffers

CacheCPU MM

Write-Thru caches Write-Back caches Special

R W

W W

Writeback buffer = fast registers

Special: Used with both types of caches; used when wrote word to writeback buffer then there is a cache miss

Cache speed, buffer speed, memory speed


Write-Thru Caches

• Write generated by CPU writes into cache and also deposits the write into writeback buffer– Eventually written back to MM

• Delay perceived by CPU– max (Tcache, TWB)

• Tcache Cache access time

• TWB Time to write into writeback buffer

• Tcache, TWB < TMM


Writeback Cache

• Write to cache• Write modified victims to MM via

writeback buffer

• Delay perceived by CPU = Tcache

• Special happens on a miss, read or write


Cache Update Policies

• Keeps MM copy and cache copy of a word (ergo block) consistent

• Write-Thru (Store-Thru)– On hit if operation is a write, copies in

MM and cache are both updated simultaneously

– No need to writ e back blocks selected as victims

– Useful for multiprocessing systems (MM always has latest copy)

– If cache fails MM copy can serve as hot back up

– Can slow up CPU on writes (since MM updates take place at slower rates)


Write-Back (No Write-Thru)

• On write hit, only cache copy is updated

• Faster writes on a cache hit• Need to write back dirty blocks

selected as victims– Dirty Block: A block modified after

being brought into the cache

• Requires a clean/dirty bit for every block


Allocation Policies

• WTWA - Write Thru Write Allocate - allocate missing block in cache on both read and write miss

• WTNWA - Write Thru No Write Allocate - Don’t allocate on a write miss, allocate only for a read miss

Documents

CSCI 232© 2005 JW Ryder1 Cache Memory Systems Introduced by M.V. Wilkes (“Slave Store”) Appeared in IBM S360/85 first commercially