73
Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. Onur Mutlu ETH Zürich Spring 2020 14 May 2020

Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Digital Design & Computer Arch.Lecture 21a: Memory Organization

and Memory Technology

Prof. Onur Mutlu

ETH ZürichSpring 202014 May 2020

Page 2: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Readings for This Lecture and Nextn Memory Hierarchy and Caches

n Requiredq H&H Chapters 8.1-8.3q Refresh: P&P Chapter 3.5

n Recommendedq An early cache paper by Maurice Wilkes

n Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.

2

Page 3: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

A Computing Systemn Three key componentsn Computation n Communicationn Storage/memory

3

Burks, Goldstein, von Neumann, “Preliminary discussion of thelogical design of an electronic computing instrument,” 1946.

Image source: https://lbsitbytes2010.wordpress.com/2013/03/29/john-von-neumann-roll-no-15/

Page 4: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

What is A Computer?n We will cover all three components

4

Memory(programand data)

I/O

Processing

control(sequencing)

datapath

Page 5: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory (Programmer’s View)

5

Page 6: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Abstraction: Virtual vs. Physical Memoryn Programmer sees virtual memory

q Can assume the memory is “infinite”n Reality: Physical memory size is much smaller than what

the programmer assumesn The system (system software + hardware, cooperatively)

maps virtual memory addresses to physical memoryq The system automatically manages the physical memory

space transparently to the programmer

+ Programmer does not need to know the physical size of memory nor manage it à A small physical memory can appear as a huge one to the programmer à Life is easier for the programmer

-- More complex system software and architecture

A classic example of the programmer/(micro)architect tradeoff6

Page 7: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

(Physical) Memory Systemn You need a larger level of storage to manage a small

amount of physical memory automaticallyà Physical memory has a backing store: disk

n We will first start with the physical memory system

n For now, ignore the virtualàphysical indirection

n We will get back to it later, if time permits…

7

Page 8: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Idealism

8

InstructionSupply

Pipeline(Instructionexecution)

DataSupply

- Zero latency access

- Infinite capacity

- Zero cost

- Perfect control flow

- No pipeline stalls

-Perfect data flow (reg/memory dependencies)

- Zero-cycle interconnect(operand communication)

- Enough functional units

- Zero latency compute

- Zero latency access

- Infinite capacity

- Infinite bandwidth

- Zero cost

Page 9: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Quick Overview of Memory Arrays

Page 10: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

How Can We Store Data?n Flip-Flops (or Latches)

q Very fast, parallel accessq Very expensive (one bit costs tens of transistors)

n Static RAM (we will describe them in a moment)q Relatively fast, only one data word at a timeq Expensive (one bit costs 6 transistors)

n Dynamic RAM (we will describe them a bit later)q Slower, one data word at a time, reading destroys content

(refresh), needs special process for manufacturingq Cheap (one bit costs only one transistor plus one capacitor)

n Other storage technology (flash memory, hard disk, tape)q Much slower, access takes a long time, non-volatileq Very cheap (no transistors directly involved)

Page 11: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Array Organization of Memoriesn Goal: Efficiently store large amounts of data

q A memory array (stores data)q Address selection logic (selects one row of the array)q Readout circuitry (reads data out)

n An M-bit value can be read or written at each unique N-bit addressq All values can be accessed, but only M-bits at a timeq Access restriction allows more compact organization

Address

Data

ArrayN

M

Page 12: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Arraysn Two-dimensional array of bit cells

q Each bit cell stores one bit

n An array with N address bits and M data bits:q 2N rows and M columnsq Depth: number of rows (number of words)q Width: number of columns (size of word)q Array size: depth × width = 2N × M

Address

Data

ArrayN

M

Address Data11100100

depth

0 1 01 0 01 1 00 1 1

width

Address

Data

Array2

3

Page 13: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

n 22× 3-bit arrayn Number of words: 4n Word size: 3-bitsn For example, the 3-bit word stored at address 10 is 100

Address Data11100100

depth

0 1 01 0 01 1 00 1 1

width

Address

Data

Array2

3

Memory Array Example

Page 14: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Larger and Wider Memory Array Example

Address

Data

1024-word x32-bitArray

10

32

Page 15: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Array Organization (I)n Storage nodes in one column connected to one bitlinen Address decoder activates only ONE wordlinen Content of one line of storage available at output

wordline311

10

2:4Decoder

Address

01

00

storedbit = 0wordline2

wordline1

wordline0

storedbit = 1

storedbit = 0

storedbit = 1

storedbit = 0

storedbit = 0

storedbit = 1

storedbit = 1

storedbit = 0

storedbit = 0

storedbit = 1

storedbit = 1

bitline2 bitline1 bitline0

Data2 Data1 Data0

2

Page 16: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Array Organization (II)n Storage nodes in one column connected to one bitlinen Address decoder activates only ONE wordlinen Content of one line of storage available at output

wordline311

10

2:4Decoder

Address

01

00

storedbit = 0wordline2

wordline1

wordline0

storedbit = 1

storedbit = 0

storedbit = 1

storedbit = 0

storedbit = 0

storedbit = 1

storedbit = 1

storedbit = 0

storedbit = 0

storedbit = 1

storedbit = 1

bitline2 bitline1 bitline0

Data2 Data1 Data0

2

10

1 0 0

Active wordline

Page 17: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

How is Access Controlled?n Access transistors configured as switches connect the bit

storage to the bitline.n Access controlled by the wordline

stored bit

wordlinebitline

wordlinebitline bitlinewordline

bitline

DRAM SRAM

Page 18: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Building Larger Memoriesn Requires larger memory arrays

n Large à slow

n How do we make the memory large without making it very slow?

n Idea: Divide the memory into smaller arrays and interconnect the arrays to input/output busesq Large memories are hierarchical array structuresq DRAM: Channel à Rank à Bank à Subarrays à Mats

18

Page 19: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

General Principle: Interleaving (Banking)n Interleaving (banking)

q Problem: a single monolithic large memory array takes long to access and does not enable multiple accesses in parallel

q Goal: Reduce the latency of memory array access and enable multiple accesses in parallel

q Idea: Divide a large array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)n Each bank is smaller than the entire memory storagen Accesses to different banks can be overlapped

q A Key Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)

19

Page 20: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Technology:DRAM and SRAM

Page 21: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Technology: DRAMn Dynamic random access memoryn Capacitor charge state indicates stored value

q Whether the capacitor is charged or discharged indicates storage of 1 or 0

q 1 capacitorq 1 access transistor

n Capacitor leaks through the RC pathq DRAM cell loses charge over timeq DRAM cell needs to be refreshed

21

row enable

_bitl

ine

Page 22: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

n Static random access memoryn Two cross coupled inverters store a single bit

q Feedback path enables the stored value to persist in the “cell”q 4 transistors for storageq 2 transistors for access

Memory Technology: SRAM

22

row select

bitli

ne

_bitl

ine

Page 23: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Bank Organization and Operationn Read access sequence:

1. Decode row address & drive word-lines

2. Selected bits drive bit-lines

• Entire row read

3. Amplify row data

4. Decode column address & select subset of row

• Send to output

5. Precharge bit-lines• For next access

23

Page 24: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

SRAM (Static Random Access Memory)

24

bit-cell array

2n row x 2m-col

(n»m to minimizeoverall latency)

sense amp and mux2m diff pairs

2nn

m

1

row select

bitli

ne

_bitl

ine

n+m

Read Sequence1. address decode2. drive row select3. selected bit-cells drive bitlines

(entire row is read together)4. differential sensing and column select

(data is ready)5. precharge all bitlines

(for next read or write)

Access latency dominated by steps 2 and 3Cycling time dominated by steps 2, 3 and 5

- step 2 proportional to 2m

- step 3 and 5 proportional to 2n

Page 25: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

DRAM (Dynamic Random Access Memory)

25

row enable_b

itlin

e

bit-cell array

2n row x 2m-col

(n»m to minimizeoverall latency)

sense amp and mux2m

2nn

m

1

RAS

CASA DRAM die comprises of multiple such arrays

Bits stored as charges on node capacitance (non-restorative)

- bit cell loses charge when read- bit cell loses charge over time

Read Sequence1~3 same as SRAM4. a “flip-flopping” sense amp

amplifies and regenerates the bitline, data bit is mux’ed out

5. precharge all bitlines

Destructive readsCharge loss over timeRefresh: A DRAM controller must periodically read each row within the allowed refresh time (10s of ms) such that charge is restored

Page 26: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

DRAM vs. SRAMn DRAM

q Slower access (capacitor)q Higher density (1T 1C cell)q Lower costq Requires refresh (power, performance, circuitry)q Manufacturing requires putting capacitor and logic together

n SRAMq Faster access (no capacitor)q Lower density (6T cell)q Higher costq No need for refreshq Manufacturing compatible with logic process (no capacitor)

26

Page 27: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Digital Design & Computer Arch.Lecture 21a: Memory Organization

and Memory Technology

Prof. Onur Mutlu

ETH ZürichSpring 202014 May 2020

Page 28: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

We did not cover the following slides in lecture. These are for your preparation for the next lecture.

Page 29: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

The Memory Hierarchy

Page 30: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory in a Modern System

30

CORE 1

L2 CA

CH

E 0

SHA

RED

L3 CA

CH

E

DR

AM

INTER

FAC

E

CORE 0

CORE 2 CORE 3L2 C

AC

HE 1

L2 CA

CH

E 2

L2 CA

CH

E 3

DR

AM

BA

NK

S

DRAM MEMORY CONTROLLER

Page 31: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Ideal Memoryn Zero access time (latency)n Infinite capacityn Zero costn Infinite bandwidth (to support multiple accesses in parallel)

31

Page 32: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

The Problemn Ideal memory’s requirements oppose each other

n Bigger is slowerq Bigger à Takes longer to determine the location

n Faster is more expensiveq Memory technology: SRAM vs. DRAM vs. Disk vs. Tape

n Higher bandwidth is more expensiveq Need more banks, more ports, higher frequency, or faster

technology

32

Page 33: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

The Problemn Bigger is slower

q SRAM, 512 Bytes, sub-nanosecq SRAM, KByte~MByte, ~nanosecq DRAM, Gigabyte, ~50 nanosecq Hard Disk, Terabyte, ~10 millisec

n Faster is more expensive (dollars and chip area)q SRAM, < 10$ per Megabyteq DRAM, < 1$ per Megabyteq Hard Disk < 1$ per Gigabyteq These sample values (circa ~2011) scale with time

n Other technologies have their place as well q Flash memory (mature), PC-RAM, MRAM, RRAM (not mature yet)

33

Page 34: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Why Memory Hierarchy?n We want both fast and large

n But we cannot achieve both with a single level of memory

n Idea: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(er) level(s)

34

Page 35: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

The Memory Hierarchy

35

fastsmall

big but slow

move what you use here

backupeverythinghere

With good locality of reference, memory appears as fast asand as large as

fast

er p

er b

yte

chea

per p

er b

yte

Page 36: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Hierarchyn Fundamental tradeoff

q Fast memory: smallq Large memory: slow

n Idea: Memory hierarchy

n Latency, cost, size, bandwidth

36

CPUMain

Memory(DRAM)RF

Cache

Hard Disk

Page 37: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Localityn One’s recent past is a very good predictor of his/her near

future.

n Temporal Locality: If you just did something, it is very likely that you will do the same thing again soonq since you are here today, there is a good chance you will be

here again and again regularly

n Spatial Locality: If you did something, it is very likely you will do something similar/related (in space)q every time I find you in this room, you are probably sitting

close to the same people

37

Page 38: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Memory Localityn A “typical” program has a lot of locality in memory

referencesq typical programs are composed of “loops”

n Temporal: A program tends to reference the same memory location many times and all within a small window of time

n Spatial: A program tends to reference a cluster of memory locations at a time q most notable examples:

n 1. instruction memory references n 2. array/data structure references

38

Page 39: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Caching Basics: Exploit Temporal Localityn Idea: Store recently accessed data in automatically

managed fast memory (called cache)n Anticipation: the data will be accessed again soon

n Temporal locality principleq Recently accessed data will be again accessed in the near

futureq This is what Maurice Wilkes had in mind:

n Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.

n “The use is discussed of a fast core memory of, say 32000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.”

39

Page 40: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Caching Basics: Exploit Spatial Localityn Idea: Store addresses adjacent to the recently accessed

one in automatically managed fast memoryq Logically divide memory into equal size blocksq Fetch to cache the accessed block in its entirety

n Anticipation: nearby data will be accessed soon

n Spatial locality principleq Nearby data in memory will be accessed in the near future

n E.g., sequential instruction access, array traversalq This is what IBM 360/85 implemented

n 16 Kbyte cache with 64 byte blocksn Liptay, “Structural aspects of the System/360 Model 85 II: the

cache,” IBM Systems Journal, 1968.

40

Page 41: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

The Bookshelf Analogyn Book in your handn Deskn Bookshelfn Boxes at homen Boxes in storage

n Recently-used books tend to stay on deskq Comp Arch books, books for classes you are currently takingq Until the desk gets full

n Adjacent books in the shelf needed around the same timeq If I have organized/categorized my books well in the shelf

41

Page 42: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Caching in a Pipelined Designn The cache needs to be tightly integrated into the pipeline

q Ideally, access in 1-cycle so that load-dependent operations do not stall

n High frequency pipeline à Cannot make the cache largeq But, we want a large cache AND a pipelined design

n Idea: Cache hierarchy

42

CPU

MainMemory(DRAM)

RFLevel1Cache

Level 2Cache

Page 43: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

A Note on Manual vs. Automatic Management

n Manual: Programmer manages data movement across levels-- too painful for programmers on substantial programsq “core” vs “drum” memory in the 50’sq still done in some embedded processors (on-chip scratch pad SRAM in lieu of a cache) and GPUs (called “shared memory”)

n Automatic: Hardware manages data movement across levels, transparently to the programmer++ programmer’s life is easierq the average programmer doesn’t need to know about it

n You don’t need to know how big the cache is and how it works to write a “correct” program! (What if you want a “fast” program?)

43

Page 44: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Automatic Management in Memory Hierarchy

n Wilkes, “Slave Memories and Dynamic Storage Allocation,”IEEE Trans. On Electronic Computers, 1965.

n “By a slave memory I mean one which automatically accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again.”

44

Page 45: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Historical Aside: Other Cache Papersn Fotheringham, “Dynamic Storage Allocation in the Atlas

Computer, Including an Automatic Use of a Backing Store,” CACM 1961.q http://dl.acm.org/citation.cfm?id=366800

n Bloom, Cohen, Porter, “Considerations in the Design of a Computer with High Logic-to-Memory Speed Ratio,” AIEE Gigacycle Computing Systems Winter Meeting, Jan. 1962.

45

Page 46: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

A Modern Memory Hierarchy

46

Register File32 words, sub-nsec

L1 cache~32 KB, ~nsec

L2 cache512 KB ~ 1MB, many nsec

L3 cache, .....

Main memory (DRAM), GB, ~100 nsec

Swap Disk100 GB, ~10 msec

manual/compilerregister spilling

automaticdemand paging

AutomaticHW cachemanagement

MemoryAbstraction

Page 47: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Hierarchical Latency Analysisn For a given memory hierarchy level i it has a technology-intrinsic

access time of ti, The perceived access time Ti is longer than tin Except for the outer-most hierarchy, when looking for a given

address there is q a chance (hit-rate hi) you “hit” and access time is tiq a chance (miss-rate mi) you “miss” and access time ti +Ti+1 q hi + mi = 1

n ThusTi = hi·ti + mi·(ti + Ti+1)Ti = ti + mi ·Ti+1

hi and mi are defined to be the hit-rate and miss-rate of just the references that missed at Li-1

47

Page 48: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Hierarchy Design Considerationsn Recursive latency equation

Ti = ti + mi ·Ti+1

n The goal: achieve desired T1 within allowed costn Ti » ti is desirable

n Keep mi lowq increasing capacity Ci lowers mi, but beware of increasing ti

q lower mi by smarter management (replacement::anticipate what you don’t need, prefetching::anticipate what you will need)

n Keep Ti+1 lowq faster lower hierarchies, but beware of increasing costq introduce intermediate hierarchies as a compromise

48

Page 49: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

n 90nm P4, 3.6 GHzn L1 D-cache

q C1 = 16Kq t1 = 4 cyc int / 9 cycle fp

n L2 D-cacheq C2 =1024 KB q t2 = 18 cyc int / 18 cyc fp

n Main memoryq t3 = ~ 50ns or 180 cyc

n Noticeq best case latency is not 1 q worst case access latencies are into 500+ cycles

if m1=0.1, m2=0.1T1=7.6, T2=36

if m1=0.01, m2=0.01T1=4.2, T2=19.8

if m1=0.05, m2=0.01T1=5.00, T2=19.8

if m1=0.01, m2=0.50T1=5.08, T2=108

Intel Pentium 4 Example

Page 50: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Cache Basics and Operation

Page 51: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Cachen Generically, any structure that “memoizes” frequently used

results to avoid repeating the long-latency operations required to reproduce the results from scratch, e.g. a web cache

n Most commonly in the on-die context: an automatically-managed memory hierarchy based on SRAMq memoize in SRAM the most frequently accessed DRAM

memory locations to avoid repeatedly paying for the DRAM access latency

51

Page 52: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Caching Basicsn Block (line): Unit of storage in the cache

q Memory is logically divided into cache blocks that map to locations in the cache

n On a reference:q HIT: If in cache, use cached data instead of accessing memoryq MISS: If not in cache, bring block into cache

n Maybe have to kick something else out to do it

n Some important cache design decisionsq Placement: where and how to place/find a block in cache?q Replacement: what data to remove to make room in cache?q Granularity of management: large or small blocks? Subblocks?q Write policy: what do we do about writes?q Instructions/data: do we treat them separately?

52

Page 53: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Cache Abstraction and Metrics

n Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)n Average memory access time (AMAT)

= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )n Aside: Can reducing AMAT reduce performance?

53

AddressTag Store

(is the addressin the cache?

+ bookkeeping)

Data Store

(stores memory blocks)

Hit/miss? Data

Page 54: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

A Basic Hardware Cache Designn We will start with a basic hardware cache design

n Then, we will examine a multitude of ideas to make it better

54

Page 55: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Blocks and Addressing the Cachen Memory is logically divided into fixed-size blocks

n Each block maps to a location in the cache, determined by the index bits in the addressq used to index into the tag and data stores

n Cache access: 1) index into the tag and data stores with index bits in address 2) check valid bit in tag store3) compare tag bits in address with the stored tag in tag store

n If a block is in the cache (cache hit), the stored tag should be valid and match the tag of the block

55

8-bit address

tag index byte in block

3 bits3 bits2b

Page 56: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Direct-Mapped Cache: Placement and Access

n Assume byte-addressable memory: 256 bytes, 8-byte blocks à 32 blocks

n Assume cache: 64 bytes, 8 blocksq Direct-mapped: A block can go to only one location

q Addresses with same index contend for the same locationn Cause conflict misses

56

Tag store Data store

Address

tag index byte in block

3 bits3 bits2b

V tag

=? MUXbyte in block

Hit? Data

Page 57: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Direct-Mapped Cachesn Direct-mapped cache: Two blocks in memory that map to

the same index in the cache cannot be present in the cache at the same timeq One index à one entry

n Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index q Assume addresses A and B have the same index bits but

different tag bitsq A, B, A, B, A, B, A, B, … à conflict in the cache indexq All accesses are conflict misses

57

Page 58: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

n Addresses 0 and 8 always conflict in direct mapped cachen Instead of having one column of 8, have 2 columns of 4 blocks

Set Associativity

58

Tag store Data store

V tag

=?

V tag

=?

Addresstag index byte in block

3 bits2 bits3b

Logic

MUX

MUXbyte in block

Key idea: Associative memory within the set+ Accommodates conflicts better (fewer conflict misses)-- More complex, slower access, larger tag store

SET

Hit?

Page 59: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Higher Associativityn 4-way

+ Likelihood of conflict misses even lower-- More tag comparators and wider data mux; larger tags

59

Tag store

Data store

=?=? =?=?

MUX

MUXbyte in block

Logic Hit?

Page 60: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Full Associativityn Fully associative cache

q A block can be placed in any cache location

60

Tag store

Data store

=? =? =? =? =? =? =? =?

MUX

MUXbyte in block

Logic

Hit?

Page 61: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Associativity (and Tradeoffs)n Degree of associativity: How many blocks can map to the

same index (or set)?

n Higher associativity++ Higher hit rate-- Slower cache access time (hit latency and data access latency)-- More expensive hardware (more comparators)

n Diminishing returns from higherassociativity

61associativity

hit rate

Page 62: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Issues in Set-Associative Cachesn Think of each block in a set having a “priority”

q Indicating how important it is to keep the block in the cachen Key issue: How do you determine/adjust block priorities?n There are three key decisions in a set:

q Insertion, promotion, eviction (replacement)

n Insertion: What happens to priorities on a cache fill?q Where to insert the incoming block, whether or not to insert the block

n Promotion: What happens to priorities on a cache hit?q Whether and how to change block priority

n Eviction/replacement: What happens to priorities on a cache miss?q Which block to evict and how to adjust priorities

62

Page 63: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Eviction/Replacement Policyn Which block in the set to replace on a cache miss?

q Any invalid block firstq If all are valid, consult the replacement policy

n Randomn FIFOn Least recently used (how to implement?)n Not most recently usedn Least frequently used?n Least costly to re-fetch?

q Why would memory accesses have different cost?n Hybrid replacement policiesn Optimal replacement policy?

63

Page 64: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Implementing LRUn Idea: Evict the least recently accessed blockn Problem: Need to keep track of access ordering of blocks

n Question: 2-way set associative cache:q What do you need to implement LRU perfectly?

n Question: 4-way set associative cache: q What do you need to implement LRU perfectly?q How many different orderings possible for the 4 blocks in the

set? q How many bits needed to encode the LRU order of a block?q What is the logic needed to determine the LRU victim?

64

Page 65: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Approximations of LRUn Most modern processors do not implement “true LRU” (also

called “perfect LRU”) in highly-associative caches

n Why?q True LRU is complexq LRU is an approximation to predict locality anyway (i.e., not

the best possible cache management policy)

n Examples:q Not MRU (not most recently used)q Hierarchical LRU: divide the N-way set into M “groups”, track

the MRU group and the MRU way in each groupq Victim-NextVictim Replacement: Only keep track of the victim

and the next victim65

Page 66: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Hierarchical LRU (not MRU)n Divide a set into multiple groupsn Keep track of only the MRU groupn Keep track of only the MRU block in each group

n On replacement, select victim as:q A not-MRU block in one of the not-MRU groups (randomly pick

one of such blocks/groups)

66

Page 67: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Hierarchical LRU (not MRU): Questionsn 16-way cachen 2 8-way groups

n What is an access pattern that performs worse than true LRU?

n What is an access pattern that performs better than true LRU?

67

Page 68: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Victim/Next-Victim Policyn Only 2 blocks’ status tracked in each set:

q victim (V), next victim (NV)q all other blocks denoted as (O) – Ordinary block

n On a cache missq Replace Vq Demote NV to Vq Randomly pick an O block as NV

n On a cache hit to Vq Demote NV to Vq Randomly pick an O block as NVq Turn V to O

68

Page 69: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Victim/Next-Victim Policy (II)n On a cache hit to NV

q Randomly pick an O block as NVq Turn NV to O

n On a cache hit to Oq Do nothing

69

Page 70: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Victim/Next-Victim Example

70

Page 71: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Cache Replacement Policy: LRU or Randomn LRU vs. Random: Which one is better?

q Example: 4-way cache, cyclic references to A, B, C, D, E n 0% hit rate with LRU policy

n Set thrashing: When the “program working set” in a set is larger than set associativityq Random replacement policy is better when thrashing occurs

n In practice:q Depends on workloadq Average hit rate of LRU and Random are similar

n Best of both Worlds: Hybrid of LRU and Randomq How to choose between the two? Set sampling

n See Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ISCA 2006.

71

Page 72: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

What Is the Optimal Replacement Policy?n Belady’s OPT

q Replace the block that is going to be referenced furthest in the future by the program

q Belady, “A study of replacement algorithms for a virtual-storage computer,” IBM Systems Journal, 1966.

q How do we implement this? Simulate?

n Is this optimal for minimizing miss rate?n Is this optimal for minimizing execution time?

q No. Cache miss latency/cost varies from block to block!q Two reasons: Remote vs. local caches and miss overlappingq Qureshi et al. “A Case for MLP-Aware Cache Replacement,“

ISCA 2006.

72

Page 73: Digital Design & Computer Arch....Digital Design & Computer Arch. Lecture 21a: Memory Organization and Memory Technology Prof. OnurMutlu ETH Zürich Spring 2020 14 May 2020 Readings

Readingn Key observation: Some misses more costly than others as their latency is

exposed as stall time. Reducing miss rate is not always good for performance. Cache replacement should take into account MLP of misses.

n Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt,"A Case for MLP-Aware Cache Replacement"Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), pages 167-177, Boston, MA, June 2006. Slides (ppt)

73