Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Hierarchy

Slides contents from:

Hennessy & Patterson, 5ed. Appendix B and Chapter 2.

David Wentzlaff, ELE 475 – Computer Architecture.

MJT, High Performance Computing, NPTEL.

Introduction➢ Programmers want unlimited amounts of memory with

low latency➢ Fast memory technology is more expensive per bit than

slower memory➢ Solution: organize memory system into a hierarchy

➢ Entire addressable memory space available in largest, slowest memory

➢ Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor

➢ Temporal and spatial locality insures that nearly all references can be found in smaller memories

➢ Gives the allusion of a large, fast memory being presented to the processor

Register File

Memory Array - SRAM

Memory Arrays – DRAM

SRAM vs. DRAM

Foss, R.C. “Implementing Application-Specific Memory”, ISSCC 1996

Memory Technology Trade-offs

Latches/Registers

Register File

SRAM

DRAM

Low CapacityLow LatencyHigh Bandwidth(more and wider ports)

High CapacityHigh LatencyLow Bandwidth

Memory Hierarchy

Memory Performance Gap

Memory Performance Gap● Aggregate Peak Bandwidth required by Intel i7

– 2 references per clock

– 4 cores, 3.2GHz.– 25.6 Billion 64-bit references/sec + 12.8 Billion128-

bit instruction references

– = 409.6 GB/s

● DRAM Capacity = 25.6 GB/s– Multiport, Pipelined caches– Two levels of cache per core

– Shared third-level cache on chip

Predictable Memory Reference Patterns

Spatial andTemporal Locality

Hatfield and Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Inside a Cache

PROCESSORPROCESSOR MAINMEMORY

MAINMEMORYCACHECACHE

CONTROLLER

TAGS

DATA

. . .0 1 N Line (Block)

Classification of Caches● Block Placement

– Where can a block be placed in a cache?

– Direct mapped, Set associative, Fully associative

● Block Identification– How is a block found if it is in cache?

● Block Replacement– Which block should be replaced on a miss?

● Write Strategy– What happens on a write?

Block Identification 2 Index=CacheSize

BlockSize×SetAssociativity

Direct Mapped

2-way Set Associative

Block Replacement● No choice in a direct mapped cache● In an associative cache, which block from set should be

evicted when the set becomes full?● Random● Least Recently Used (LRU)

– LRU cache state must be updated on every access

– True implementation only feasible for small sets (2-way)

– Almost LRU

● First In, First Out (FIFO) aka Round-Robin– Used in highly associative caches

● Not Most Recently Used (NMRU)– FIFO with exception for most recently used block(s)

Write Strategy: How are writes handled?

● Cache Hit– Write Through – write both cache and memory, generally

higher traffic but simpler to design– Write Back – write cache only, memory is written when evicted,

dirty bit per block avoids unnecessary write backs, more complicated

● Cache Miss– No Write Allocate – only write to main memory– Write Allocate – fetch block into cache, then write

● Common Combinations– Write Through & No Write Allocate

– Write Back & Write Allocate

Average Memory Access Time

PROCESSORPROCESSOR MAINMEMORY

MAINMEMORYCACHECACHE

HIT

MISS

Avg. Memory Access Time=Hit Time+(Miss Rate×Miss Penalty)

Categorizing Misses: The Three C's● Cold Start Misses (Compulsory)

– First-reference to a block

● Capacity– Cache is too small to hold all data needed by

program, occur even under perfect replacement policy

● Conflict (Collision) Misses– Misses that occur because of collisions due to less

than full associativity

Six Basic Cache Optimizations● Larger block size to reduce miss rate● Larger caches to reduce miss rate● Higher associativity to reduce miss rate● Multilevel caches to reduce miss penalty● Giving priority to read misses over writes to

reduce miss penalty● Avoiding address translation during indexing of

the cache to reduce hit time

Prioritizing read misses over writes

● Direct Mapped, Write through cache maps address 512 and 1024 to the same cache block. Write buffer is not checked on a read miss. Will the value in R2 always be equal to value in R3?

● Copy the dirty block to a buffer. On a read: check the buffer for the address, read memory and then write memory.

Avoiding Address Translation During Indexing of the Cache to Reduce Hit Time

● Address Translation is on the critical path– Determines clock cycle time of the processor!

● Use Virtual address or Physical address for index or tag?– Eg. Virtually addressed Physically tagged caches

CPU MMU

VirtualAddress

PhysicalAddress

Cache

CPU Cache

VirtualAddress

PhysicalAddress

MMUCacheMiss

MM

Virtual Memory

Virtual Memory

Page Translation Table per Process

Virtual Page Number

Physical Page Number

DiskAddress

ValidBit

Opteron Example

Virtual Memory – Page Faults● Situation where virtual address generated by

processor is not available in main memory● Detected on attempt to translate address

– Page Table entry is invalid

● Must be `handled’ by operating system– Identify slot in main memory to be used

– Get page contents from disk

– Update page table entry

● Data can then be provided to the processor

Fast Translation● Address translation is on the critical path● Paging – 2 memory accesses!

– Address Translation Table + Data

● Translation Lookaside Buffer (TLB)

VM – Example

Cache Access Example

TagTag Index (15b)Index (15b)

Direct mapped, 32 KB, 32B block,32b main memory address

OffsetOffset

32 b address32 b address

32B Block32B BlockW 1W 1

=?

NoCache Miss

33 22

YesCache Hit

Tag is not needed until the cache linehas been read

W 1W 1 W 2 ... 7W 2 ... 7 W 8W 8

. . .

W 2 ... 7W 2 ... 7 W 8W 8

To Processor

VM Example

IndexIndex OffsetOffset

64 b VM address64 b VM address

=?

Virtual Page No.Virtual Page No. Page OffsetPage Offset

Cache BlockCache Block

Index comes from the Virtual Address(Virtually Indexed)

TLB TagTLB Tag TLB IndexTLB Index

Physical Address (Tag)Physical Address (Tag)

=?

Physical AddressPhysical Address

Tag comes from the Physical Address(Physically tagged)

TLB Hit/Page Fault

L1 Hit/Miss

To L2

TagTag IndexIndex OffsetOffset

L2 Cache BlockL2 Cache Block

=?

Summary of Basic Cache Optimizations

Advanced Cache Optimizations● Reducing hit time

– Small L1, Way prediction

● Increasing cache bandwidth– Pipelined caches, multibanked caches, non-blocking caches

● Reducing the miss penalty– Critical word first, merging write buffers

● Reducing the miss rate– Compiler optimizations

● Reducing the miss penalty or miss rate via parallelism– Hardware and compiler prefetchng

Small and Simple L1 Caches● Low latency and low power

– Critical timing path:● Indexing, tag comparison, select correct set

● Direct-mapped caches can overlap tag compare and transmission of data

● Lower associativity reduces power because fewer cache lines are accessed

● CAD tools to estimate hit time and power

– CACTI– Input parameters: Feature size, cache size,

associativity, number of read/write ports.

L1 Size and Associativity

Energy per read vs. size and associativity

Way Prediction● To improve hit time, predict the way; pre-set mux

– Extra bits per cache block– Mis-prediction gives longer hit time– Prediction accuracy

● 90% for two-way, 80% for four-way● I-cache has better accuracy than D-cache

– Today: ARM Cortex-A8● Extend to predict block as well (Way selection)

– Saves access power – Increases mis-prediction penalty– Difficult to pipeline caches

Pipelining Cache● Pipeline cache access to improve bandwidth

– Examples:

– Pentium: 1 cycle, Pentium Pro – Pentium III – 2 cycles, Pentium 4 – Core i7: 4 cycles.

● Increase branch misprediction penalty● Make it easier to increase associativity

Nonblocking Caches● Allow hits before previous

misses complete– Hit under miss

– Hit under multiple misses

● L2 must support this

● In general, processors can hide L1 miss penalty but not L2 miss penalty

Multibanked Caches● Cache block is divided into banks● Banks can be simultaneously accessed● Spread addresses of of the block sequentially

(Sequential Interleaving)

Critical Word First and Early Restart● Processor needs only one word at a time● Critical word first:

– Request missed word from memory first

– Send it to the processor as soon as it arrives

● Early restart:– Request words in normal order

– Send missed work to the processor as soon as it arrives

● Effectiveness of this strategy depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

Merging Write Buffer● When storing to a block that is already

pending in the write buffer, update write buffer● Reduces stalls due to full write buffer● Do not apply the I/O addresses

No write buffering

Write buffering

Compiler Optimizations● Loop interchange

– Swap nested loops to access memory in sequential order

● Blocking– Instead of accessing entire rows or columns,

subdivide matrices into blocks

– Requires more memory accesses but improves locality of accesses

Storage Order of Arrays● 2D matrices storage order in

memory– Compiler's decision

● Row major order– Elements stored row-wise

– C, C++

● Column major order– Elements are stored column-

wise

– FORTRAN

2D Matrix Sumdouble A[1024][1024], B[1024][1024];for (j=0;j<1024;j++) for (i=0;i<1024;i++) B[i][j] = A[i][j] + B[i][j];

Load A[0][0] Load B[0][0] Store B[0][0]


.... .... ....



... .... ....

● Load A[0][0] is a Cold Start miss

● Cache Block contains A[0][0] → A[0][3]

● Reference order is different from Storage order!

● Every Row: Cold miss, Cold Miss, Hit

● Hit ratio: 33%

Loop Interchangedouble A[1024][1024], B[1024][1024];for (j=0;j<1024;j++) for (i=0;i<1024;i++) B[i][j] = A[i][j] + B[i][j];

double A[1024][1024], B[1024][1024];for (i=0;i<1024;i++) for (j=0;j<1024;j++) B[i][j] = A[i][j] + B[i][j];

Loop Interchange

● Cache Block contains A[0][0] → A[0][3]● Hit Ratio: 83.3%

double A[1024][1024], B[1024][1024];for (i=0;i<1024;i++) for (j=0;j<1024;j++) B[i][j] = A[i][j] + B[i][j];



.... .... ....



... .... ....

Matrix Multiplicationdouble X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) X[i][j] += Y[i][k] * Z[k][j];

Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0], .... X[0,0]Y[0,0], Z[0,1], Y[0,1], Z[1,1], Y[0,2], Z[2,1], .... X[0,1]........Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0], .... X[1,0]

Blocking● Make full use of the elements of Z when they

are brought into the cache

(0,0) (0,1)

(1,0) (1,1)

(0,0) (0,1)

(1,0) (1,1)

Y Z

X X x Y

0,0 0,0 x 0,0 + 0,1 x 1,0

1,0 1,0 x 0,0 + 1,1 x 1,0

Blocking

double X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) X[i][j] += Y[i][k] * Z[k][j];

double X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) X[i][j] += Y[i][k] * Z[k][j];

double X[N][N], Y[N][N], Z[N][N];for (J=0;J<N;J+=B)for (K=0;K<N;K+=B)for (i=0;i<N;i++) for (j=J;j<min(J+B,N);j++) for (k=K,r=0;k<min(K+B,N);k++) r += Y[i][k] * Z[k][j]; X[i][j] += r;

Hardware Prefetching Fetch two blocks on miss (include next sequential

block)

Pentium 4 Pre-fetching

Compiler Prefetching

Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions

Register prefetch Loads data into register

Cache prefetch Loads data into cache

Combine with loop unrolling and software pipelining

Summary

Memory Technology● Main memory serves as input and output to

I/O interfaces and the processor.● DRAMs for main memory, SRAM for caches● Metrics: Latency, Bandwidth● Access Time

– Time between read request and when desired word arrives

● Cycle Time– Minimum time between unrelated requests to

memory

Memory Technology SRAM

Requires low power to retain bit, 6 transistors/bit DRAM

Must be re-written after being read Must be periodically refreshed

• Every ~ 8 ms Address lines are multiplexed:

Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS)

Memory Technology - Optimizations● Multiple accesses to same row● Synchronous DRAM

– Clocked operation, Burst mode

● Wider interfaces● Double data rate● Multiple banks on each DRAM device

Clock rates, Bandwidth and Names

2.5V

1.8V

1.5V

1 - 1.2V

● GDDR5 – Graphics memory based on DDR3– 2x – 5x bandwidth per DRAM vs. DDR3

– Wider interface, higher clockrate

– Attached via soldering instead of socketted DIMM

Flash Memory● Type of EEPROM● Must be erased (in blocks) before being

overwritten● Non-volatile● Limited number of write cycles● Cheaper than SDRAM, more expensive than

disk● Slower than SRAM, faster than disk

Memory Dependability

● Memory is susceptible to cosmic rays– Soft errors

● Hard errors (permanent)– ECC has detect and correct

– Spare rows are used as replacement in case a row fails

● Chip kill: RAID-like errror recovery technique

Documents

Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory