21
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe at 13 knots on Cockburn Sound, WA

SOFTENG 363

Embed Size (px)

DESCRIPTION

SOFTENG 363. Computer Architecture Cache John Morris ECE/CS, The University of Auckland. Iolanthe at 13 knots on Cockburn Sound, WA. Cache. Small, fast memory Typically ~50kbytes (1998) 2 cycle access time Same die as processor “Off-chip” cache possible - PowerPoint PPT Presentation

Citation preview

SOFTENG 363

Computer Architecture

Cache

John Morris

ECE/CS, The University of Auckland

Iolanthe at 13 knots on Cockburn Sound, WA

Cache

• Small, fast memory• Typically ~50kbytes (1998)• 2 cycle access time

• Same die as processor• “Off-chip” cache possible

• Custom cache chip closely coupled to processor

• Use fast static RAM (SRAM) rather thanslower dynamic RAM

• Several levels possible

• 2nd level of the memory hierarchy• “Caches” most recently used memory

locations “closer” to the processor• closer = closer in time

Cache

• Etymology• cacher (French) = “to hide”

• Transparent to a program• Programs simply run slower without it

• Modern processors rely on it• Reduces the cost of main memory access• Enables instruction/cycle throughput• Typical program

• ~25% memory accesses

Cache

• Relies upon locality of reference• Programs continually use - and re-use -

the same locations• Instructions

• loops, • common subroutines

• Data• look-up tables• “working” data sets

Cache - operation

• Memory requests checked in cache first• If the word sought is in the cache,

it’s read from cache (or updated in cache)Cache hit

• If not, request is passed to main memoryand data is read (written) thereCache miss

CPU

MMU

CacheMainMemD or I

VA PAPA

D or I

Cache - operation

• Hit rates of 95% are usual• Cache: 16 kbytes

• Effective Memory Access Time• Cache: 2 cycles• Main memory: 10 cycles• Average access: 0.95*2 + 0.05*10 = 2.4 cycles

Cache - organisation

• Direct-mapped cache• Each word in the cache has a tag• Assume

• cache size - 2k words• machine words - p bits• byte-addressed memory

• m = log2 ( p/8 ) bits not used to address words

• m = 2 for 32-bit machines

p-k-m mk

p bits

tag cache address byte address

Addressformat

Cache - organisation

• Direct-mapped cache

p-k-m mk

tag cache address byte address

tagdata

Hit?

memory

CPU

2k lines

p-k-mp

A cache line

Memory address

Cache - Direct Mapped

• Conflicts• Two addresses separated by 2k+m

will hit the same cache location• 32-bit machine, 64kbyte (16kword) cachem = 2, k = 14Any program or data set larger than 64kb

will generate conflicts• On a conflict, the ‘old’ word is flushed

• Unmodified word ( Program, constant data )

overwritten by the new data from memory• Modified data needs to be written back to

memory before being overwritten

Cache - Conflicts

• Modified or dirty words When a word is modified in cache

Write-back cache• Only writes data back when neededMissesTwo memory accesses

• Write modified word back

• Read new word

Write-through cache• Low priority write to main memory is queued• Processor is delayed by read only

• Memory write occurs in parallel with other work

• Instruction and necessary data fetches take priority

Cache - Write-through or write-back?

• Write-through• Allows an intelligent bus interface unit

to make efficient use of a serious bottle-neck

Processor - memory interface(Main memory bus)

• Reads (instruction and data) need priority!• They stall the processor• Writes can be delayed

• At least until the location is needed!

• More on intelligent system interface units later

but ...

Cache - Write-through or write-back?

• Write-through• Seems a good idea!

but ...• Multiple writes to the same location waste

memory bus bandwidthTypical programs run better with write-back

caches

however• Often you can easily predict which will be bestSome processors (eg PowerPC) allow you to

classify memory regions as write-back or write-through

Cache - more bits

• Cache lines need some status bits• Tag bits + ..• Valid

• All set to false on power up• Set to true as words are loaded into cache

• Dirty• Needed by write-back cache• Write- through cache always queues the

write, so lines are never ‘dirty’

Cache - Improving Performance

• Conflicts ( addresses 2k+m bytes apart )• Degrade cache performance

• Lower hit rate• Murphy’s Law operates

• Addresses are never random!• Some locations ‘thrash’ in cache

• Continually replaced and restored

Cache - Fully Associative

• All tags are compared at the same time• Words can use any cache line

Cache - Fully Associative

• Associative• Each tag is compared at the same time• Any match hit

• Avoids ‘unnecessary’ flushing• Replacement

• Least Recently Used - LRU• Needs extra status bits

• Cycles since last accessed

• Hardware cost high• Extra comparators• Wider tags

• p-m bits vs p-k-m bits

Cache - Set Associative

Each line -two wordstwo comparators only

• 2-way setassociative

Cache - Set Associative

• n-way set associative caches• n can be small: 2, 4, 8• Best performance• Reasonable hardware cost• Most high performance processors

• Replacement policy• LRU choice from n• Reasonable LRU approximation

• 1 or 2 bits• Set on access• Cleared / decremented by timer• Choose cleared word for replacement

Cache - Locality of Reference

Temporal Locality• Same location will be referenced again soon• Access same data again• Program loops - access same instruction again• Caches described so far exploit temporal

locality

Spatial Locality• Nearby locations will be referenced soon

• Next element of an array• Next instruction of a program

Cache - Line Length

• Spatial Locality• Use very long cache lines• Fetch one datum

Neighbours fetched also

• PowerPC 601 (Motorola/Apple/IBM)first of the single chip Power processors

• 64 sets• 8-way set associative• 32 bytes per line• 32 bytes (8 instructions) fetched into

instruction buffer in one cycle• 64 x 8 x 32 = 16k byte total

Cache - Separate I- and D-caches

• Unified cache• Instructions and Data in same cache

• Two caches - * Instructions * DataIncreases total bandwidth

• MIPS R10000• 32Kbyte Instruction; 32Kbyte Data• Instruction cache is pre-decoded! (32 36bits)• Data

• 8-word (64byte) line, 2-way set associative• 256 sets

• Replacement policy?