57
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed. Appendix B and Chapter 2. David Wentzlaff, ELE 475 – Computer Architecture. MJT, High Performance Computing, NPTEL.

Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Hierarchy

Slides contents from:

Hennessy & Patterson, 5ed. Appendix B and Chapter 2.

David Wentzlaff, ELE 475 – Computer Architecture.

MJT, High Performance Computing, NPTEL.

Page 2: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Introduction➢ Programmers want unlimited amounts of memory with

low latency➢ Fast memory technology is more expensive per bit than

slower memory➢ Solution: organize memory system into a hierarchy

➢ Entire addressable memory space available in largest, slowest memory

➢ Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor

➢ Temporal and spatial locality insures that nearly all references can be found in smaller memories

➢ Gives the allusion of a large, fast memory being presented to the processor

Page 3: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Register File

Page 4: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Array - SRAM

Page 5: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Arrays – DRAM

Page 6: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

SRAM vs. DRAM

Foss, R.C. “Implementing Application-Specific Memory”, ISSCC 1996

Page 7: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Technology Trade-offs

Latches/Registers

Register File

SRAM

DRAM

Low CapacityLow LatencyHigh Bandwidth(more and wider ports)

High CapacityHigh LatencyLow Bandwidth

Page 8: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Hierarchy

Page 9: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Performance Gap

Page 10: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Performance Gap● Aggregate Peak Bandwidth required by Intel i7

– 2 references per clock

– 4 cores, 3.2GHz.– 25.6 Billion 64-bit references/sec + 12.8 Billion128-

bit instruction references

– = 409.6 GB/s

● DRAM Capacity = 25.6 GB/s– Multiport, Pipelined caches– Two levels of cache per core

– Shared third-level cache on chip

Page 11: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Predictable Memory Reference Patterns

Spatial andTemporal Locality

Hatfield and Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Page 12: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Inside a Cache

PROCESSORPROCESSOR MAINMEMORY

MAINMEMORYCACHECACHE

CONTROLLER

TAGS

DATA

. . .0 1 N Line (Block)

Page 13: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Classification of Caches● Block Placement

– Where can a block be placed in a cache?

– Direct mapped, Set associative, Fully associative

● Block Identification– How is a block found if it is in cache?

● Block Replacement– Which block should be replaced on a miss?

● Write Strategy– What happens on a write?

Page 14: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Block Identification 2 Index=CacheSize

BlockSize×SetAssociativity

Direct Mapped

2-way Set Associative

Page 15: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Block Replacement● No choice in a direct mapped cache● In an associative cache, which block from set should be

evicted when the set becomes full?● Random● Least Recently Used (LRU)

– LRU cache state must be updated on every access

– True implementation only feasible for small sets (2-way)

– Almost LRU

● First In, First Out (FIFO) aka Round-Robin– Used in highly associative caches

● Not Most Recently Used (NMRU)– FIFO with exception for most recently used block(s)

Page 16: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Write Strategy: How are writes handled?

● Cache Hit– Write Through – write both cache and memory, generally

higher traffic but simpler to design– Write Back – write cache only, memory is written when evicted,

dirty bit per block avoids unnecessary write backs, more complicated

● Cache Miss– No Write Allocate – only write to main memory– Write Allocate – fetch block into cache, then write

● Common Combinations– Write Through & No Write Allocate

– Write Back & Write Allocate

Page 17: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Average Memory Access Time

PROCESSORPROCESSOR MAINMEMORY

MAINMEMORYCACHECACHE

HIT

MISS

Avg. Memory Access Time=Hit Time+(Miss Rate×Miss Penalty)

Page 18: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Categorizing Misses: The Three C's● Cold Start Misses (Compulsory)

– First-reference to a block

● Capacity– Cache is too small to hold all data needed by

program, occur even under perfect replacement policy

● Conflict (Collision) Misses– Misses that occur because of collisions due to less

than full associativity

Page 19: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Six Basic Cache Optimizations● Larger block size to reduce miss rate● Larger caches to reduce miss rate● Higher associativity to reduce miss rate● Multilevel caches to reduce miss penalty● Giving priority to read misses over writes to

reduce miss penalty● Avoiding address translation during indexing of

the cache to reduce hit time

Page 20: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Prioritizing read misses over writes

● Direct Mapped, Write through cache maps address 512 and 1024 to the same cache block. Write buffer is not checked on a read miss. Will the value in R2 always be equal to value in R3?

● Copy the dirty block to a buffer. On a read: check the buffer for the address, read memory and then write memory.

Page 21: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Avoiding Address Translation During Indexing of the Cache to Reduce Hit Time

● Address Translation is on the critical path– Determines clock cycle time of the processor!

● Use Virtual address or Physical address for index or tag?– Eg. Virtually addressed Physically tagged caches

CPU MMU

VirtualAddress

PhysicalAddress

Cache

CPU Cache

VirtualAddress

PhysicalAddress

MMUCacheMiss

MM

Page 22: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Virtual Memory

Page 23: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Virtual Memory

Page 24: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Page Translation Table per Process

Virtual Page Number

Physical Page Number

DiskAddress

ValidBit

Page 25: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Opteron Example

Page 26: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Virtual Memory – Page Faults● Situation where virtual address generated by

processor is not available in main memory● Detected on attempt to translate address

– Page Table entry is invalid

● Must be `handled’ by operating system– Identify slot in main memory to be used

– Get page contents from disk

– Update page table entry

● Data can then be provided to the processor

Page 27: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Fast Translation● Address translation is on the critical path● Paging – 2 memory accesses!

– Address Translation Table + Data

● Translation Lookaside Buffer (TLB)

Page 28: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

VM – Example

Page 29: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Cache Access Example

TagTag Index (15b)Index (15b)

Direct mapped, 32 KB, 32B block,32b main memory address

OffsetOffset

32 b address32 b address

32B Block32B BlockW 1W 1

=?

NoCache Miss

33 22

YesCache Hit

Tag is not needed until the cache linehas been read

W 1W 1 W 2 ... 7W 2 ... 7 W 8W 8

. . .

W 2 ... 7W 2 ... 7 W 8W 8

To Processor

Page 30: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

VM Example

IndexIndex OffsetOffset

64 b VM address64 b VM address

=?

Virtual Page No.Virtual Page No. Page OffsetPage Offset

Cache BlockCache Block

Index comes from the Virtual Address(Virtually Indexed)

TLB TagTLB Tag TLB IndexTLB Index

Physical Address (Tag)Physical Address (Tag)

=?

Physical AddressPhysical Address

Tag comes from the Physical Address(Physically tagged)

TLB Hit/Page Fault

L1 Hit/Miss

To L2

TagTag IndexIndex OffsetOffset

L2 Cache BlockL2 Cache Block

=?

Page 31: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Summary of Basic Cache Optimizations

Page 32: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Advanced Cache Optimizations● Reducing hit time

– Small L1, Way prediction

● Increasing cache bandwidth– Pipelined caches, multibanked caches, non-blocking caches

● Reducing the miss penalty– Critical word first, merging write buffers

● Reducing the miss rate– Compiler optimizations

● Reducing the miss penalty or miss rate via parallelism– Hardware and compiler prefetchng

Page 33: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Small and Simple L1 Caches● Low latency and low power

– Critical timing path:● Indexing, tag comparison, select correct set

● Direct-mapped caches can overlap tag compare and transmission of data

● Lower associativity reduces power because fewer cache lines are accessed

● CAD tools to estimate hit time and power

– CACTI– Input parameters: Feature size, cache size,

associativity, number of read/write ports.

Page 34: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

L1 Size and Associativity

Energy per read vs. size and associativity

Page 35: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Way Prediction● To improve hit time, predict the way; pre-set mux

– Extra bits per cache block– Mis-prediction gives longer hit time– Prediction accuracy

● 90% for two-way, 80% for four-way● I-cache has better accuracy than D-cache

– Today: ARM Cortex-A8● Extend to predict block as well (Way selection)

– Saves access power – Increases mis-prediction penalty– Difficult to pipeline caches

Page 36: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Pipelining Cache● Pipeline cache access to improve bandwidth

– Examples:

– Pentium: 1 cycle, Pentium Pro – Pentium III – 2 cycles, Pentium 4 – Core i7: 4 cycles.

● Increase branch misprediction penalty● Make it easier to increase associativity

Page 37: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Nonblocking Caches● Allow hits before previous

misses complete– Hit under miss

– Hit under multiple misses

● L2 must support this

● In general, processors can hide L1 miss penalty but not L2 miss penalty

Page 38: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Multibanked Caches● Cache block is divided into banks● Banks can be simultaneously accessed● Spread addresses of of the block sequentially

(Sequential Interleaving)

Page 39: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Critical Word First and Early Restart● Processor needs only one word at a time● Critical word first:

– Request missed word from memory first

– Send it to the processor as soon as it arrives

● Early restart:– Request words in normal order

– Send missed work to the processor as soon as it arrives

● Effectiveness of this strategy depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

Page 40: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Merging Write Buffer● When storing to a block that is already

pending in the write buffer, update write buffer● Reduces stalls due to full write buffer● Do not apply the I/O addresses

No write buffering

Write buffering

Page 41: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Compiler Optimizations● Loop interchange

– Swap nested loops to access memory in sequential order

● Blocking– Instead of accessing entire rows or columns,

subdivide matrices into blocks

– Requires more memory accesses but improves locality of accesses

Page 42: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Storage Order of Arrays● 2D matrices storage order in

memory– Compiler's decision

● Row major order– Elements stored row-wise

– C, C++

● Column major order– Elements are stored column-

wise

– FORTRAN

Page 43: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

2D Matrix Sumdouble A[1024][1024], B[1024][1024];for (j=0;j<1024;j++)  for (i=0;i<1024;i++)    B[i][j] = A[i][j] + B[i][j];

Load A[0][0] Load B[0][0] Store B[0][0]

Load A[1][0] Load B[1][0] Store B[1][0]

.... .... ....

Load A[0][1] Load B[0][1] Store B[0][1]

Load A[0][2] Load B[0][2] Store B[0][2]

... .... ....

● Load A[0][0] is a Cold Start miss

● Cache Block contains A[0][0] → A[0][3]

● Reference order is different from Storage order!

● Every Row: Cold miss, Cold Miss, Hit

● Hit ratio: 33%

Page 44: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Loop Interchangedouble A[1024][1024], B[1024][1024];for (j=0;j<1024;j++)  for (i=0;i<1024;i++)    B[i][j] = A[i][j] + B[i][j];

double A[1024][1024], B[1024][1024];for (i=0;i<1024;i++)  for (j=0;j<1024;j++)    B[i][j] = A[i][j] + B[i][j];

Page 45: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Loop Interchange

● Cache Block contains A[0][0] → A[0][3]● Hit Ratio: 83.3%

double A[1024][1024], B[1024][1024];for (i=0;i<1024;i++)  for (j=0;j<1024;j++)    B[i][j] = A[i][j] + B[i][j];

Load A[0][0] Load B[0][0] Store B[0][0]

Load A[0][1] Load B[0][1] Store B[0][1]

.... .... ....

Load A[1][0] Load B[1][0] Store B[1][0]

Load A[1][0] Load B[1][0] Store B[1][0]

... .... ....

Page 46: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Matrix Multiplicationdouble X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++)  for (j=0;j<N;j++)    for (k=0;k<N;k++)      X[i][j] += Y[i][k] * Z[k][j];

Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0], .... X[0,0]Y[0,0], Z[0,1], Y[0,1], Z[1,1], Y[0,2], Z[2,1], .... X[0,1]........Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0], .... X[1,0]

Page 47: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Blocking● Make full use of the elements of Z when they

are brought into the cache

(0,0) (0,1)

(1,0) (1,1)

(0,0) (0,1)

(1,0) (1,1)

Y Z

X X x Y

0,0 0,0 x 0,0 + 0,1 x 1,0

1,0 1,0 x 0,0 + 1,1 x 1,0

Page 48: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Blocking

double X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++)  for (j=0;j<N;j++)    for (k=0;k<N;k++)      X[i][j] += Y[i][k] * Z[k][j];

double X[N][N], Y[N][N], Z[N][N];for (i=0;i<N;i++)  for (j=0;j<N;j++)    for (k=0;k<N;k++)      X[i][j] += Y[i][k] * Z[k][j];

double X[N][N], Y[N][N], Z[N][N];for (J=0;J<N;J+=B)for (K=0;K<N;K+=B)for (i=0;i<N;i++)  for (j=J;j<min(J+B,N);j++)    for (k=K,r=0;k<min(K+B,N);k++)      r += Y[i][k] * Z[k][j];    X[i][j] += r;

Page 49: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Hardware Prefetching Fetch two blocks on miss (include next sequential

block)

Pentium 4 Pre-fetching

Page 50: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Compiler Prefetching

Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions

Register prefetch Loads data into register

Cache prefetch Loads data into cache

Combine with loop unrolling and software pipelining

Page 51: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Summary

Page 52: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Technology● Main memory serves as input and output to

I/O interfaces and the processor.● DRAMs for main memory, SRAM for caches● Metrics: Latency, Bandwidth● Access Time

– Time between read request and when desired word arrives

● Cycle Time– Minimum time between unrelated requests to

memory

Page 53: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Technology SRAM

Requires low power to retain bit, 6 transistors/bit DRAM

Must be re-written after being read Must be periodically refreshed

• Every ~ 8 ms Address lines are multiplexed:

Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS)

Page 54: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Technology - Optimizations● Multiple accesses to same row● Synchronous DRAM

– Clocked operation, Burst mode

● Wider interfaces● Double data rate● Multiple banks on each DRAM device

Page 55: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Clock rates, Bandwidth and Names

2.5V

1.8V

1.5V

1 - 1.2V

● GDDR5 – Graphics memory based on DDR3– 2x – 5x bandwidth per DRAM vs. DDR3

– Wider interface, higher clockrate

– Attached via soldering instead of socketted DIMM

Page 56: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Flash Memory● Type of EEPROM● Must be erased (in blocks) before being

overwritten● Non-volatile● Limited number of write cycles● Cheaper than SDRAM, more expensive than

disk● Slower than SRAM, faster than disk

Page 57: Memory Hierarchy - Basavaraj Talawar · Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory

Memory Dependability

● Memory is susceptible to cosmic rays– Soft errors

● Hard errors (permanent)– ECC has detect and correct

– Spare rows are used as replacement in case a row fails

● Chip kill: RAID-like errror recovery technique