Computer Architecture Cache Design

7/29/2019 Computer Architecture Cache Design

1/28

COEN6741Chap 5.1

11/10/2003

(Dr. Sofine Tahar)

COEN6741

Computer Architecture and Design

Chapter 5

Memory Hierarchy Design

11/10/2003

Outline

Introduction

Memory HierarchyCache Memory

Cache Performance

Main Memory

Virtual Memory

Translation Lookaside

Alpha 21064 Example

Computer Architecture Topics

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,

Emerging Technologies

InterleavingBus protocols

RAID

Input/Output and Storage

MemoryHierarchy

Who Cares About the M

100

1000

mance

Processor-DRAM Memory G


2/28

LHierarchym

COEN6741Chap 5.5

11/10/2003

Levels of the Memory Hierarchy

CPU Registers100s Bytes


3/28

COEN6741Chap 5.9

11/10/2003

A Modern Memory Hierarchy

By taking advantage of the principle of locality: Present the user with as much memory as is available in the

cheapest technology. Provide access at the speed offered by the fastest technology.

Requires servicing faults on the processor

Control

Datapath

Secondary

Storage

(Disk)

Processor

Registers

Main

Memory

(DRAM)

Second

Level

Cache(SRAM)

On-Chip

Cache

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

Tertiary

Storage

(Disk/Tape)

10,000,000,000s

(10s sec)

Ts11/10/2003

The Memory Ab

Association of


4/28


5/28

Q3: Which block should be replacedon a miss?

Easy for Direct Mapped

Set Associative or Fully Associative: Random LRU (Least Recently Used)

Associativity:

2-way 4-way 8-waySize LRU Rand. LRU Random LRU Random

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

11/10/2003

Q4: What happens

Write-through: all writes update cmemory/cache

Can always discard cached data -memory Cache control bit: only a validbit

Write-back: all writes simply upda Cant just discard cached data - m

memory Cache control bits: both validand

Other Advantages: Write-through:

memory (or other processors) a Simpler management of cache Write-back:

much lower bandwidth, since datimes

Better tolerance to long-latenc

Write Policy:(What happens on write-miss?)

Write allocate: allocate new cache line in cache Usually means that you have to do a read miss tofill in rest of the cache-line!

Write Buffer for W

A Write Buffer is needed beMemory Processor: writes data into the ca

Memory controller: write contents

ProcessorCache

Write Buffer


6/28

COEN6741Chap 5.21

11/10/2003

Simplest Cache: Direct Mapped

Memory

4 Byte Direct Mapped Cache

Memory Address

0

1

2

3

4

5

6

7

8

9

A

BC

D

E

F

Cache Index

0

1

2

3

Location 0 can be occupied by datafrom:

Memory location 0, 4, 8, ... etc. In general: any memory location

whose 2 LSBs of the address are 0s Address => cache index

Which one should we place in thecache?

How can we tell which one is in thecache?

11/10/2003

Example: 1 KB Direc

For a 2 ** N byte cache: The uppermost (32 - N) bits are a

The lowest M bits are the Byte Se

31

:

Cache Tag Example: 0x50

0x50

Stored as part

of the cache state

Valid Bit

:

Cache Tag

Block address

Set Associative Cache

N-way set associative: N entries for each CacheIndex

N direct mapped caches operates in parallel

Example: Two-way set associative cache Cache Index selects a set from the cache

The two tags in the set are compared to the input in parallel Data is selected based on the tag result

Cache DataCache TagValid Cache Data Cache Tag Valid

Cache Index

Disadvantage of Set A

N-way Set Associative CacheCache:

N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decisio

In a direct mapped cache, CaBEFORE Hit/Miss: Possible to assume a hit and contin

Cache DataCache TagValidCache Inde


7/28

Block addressBlockoffset

CPUaddressDatain

Dataout

Tag Index

< 8> < 5>

Valid

Data

=?

4

3

(256

blocks)2

1

Writebuffer

Lower level memory

Tag

4:1 Mux

The organization of the data cache in the Alpha AXP 21064 microprocessor.

11/10/2003

Miss-oriented Approach to M

CPIExecution includes ALU and Memor

Inst

MemAccess

ExecutionCPIICCPUtime

Inst

MemMisses

ExecutionCPIICCPUtime

Cache perfo

Separating out Memory comp AMAT = Average Memory Access T

CPIALUOps does not include memory

Inst

MemAcceCPI

Inst

AluOpsICCPUtime

AluOps

yMissPenaltMissRateHitTimeAMAT

ataata

InstInst

MissPeMissRateHitTime

MissPenMissRateHitTime

Impact on Performance Suppose a processor executes at

Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations get 50 cyclemiss penalty

Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction

1.1(cycles/ins) +[ 0 30 (DataMops/ins)

Unified vs.. Sp

Unified vs. Separate I&D

Example: 16KB I&D: Inst miss rate=0.64%, Da

PrI-Cache-1

Proc

UnifiedCache-1

UnifiedCache-2

Proc

UnifCach


8/28

COEN6741Chap 5.29

11/10/2003

How to Improve Cache Performance?

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.

yMissPenaltMissRateHitTimeAMAT

11/10/2003

Miss Rate Re

3 Cs: Compulsory, Capacity, Con

0. Larger cache1. Reduce Misses via Larger Block2. Reduce Misses via Higher Assoc3. Reducing Misses via Victim Cach4. Reducing Misses via Pseudo-Ass5. Reducing Misses by HW Prefet6. Reducing Misses by SW Prefetc7. Reducing Misses by Compiler Op

Danger of concentrating on just Prefetching comes in two flavors Binding prefetch: Requests load

Must be correct address an Non-Binding prefetch: Load into

Can be incorrect. Frees HW

CPUtime IC CPIExecution

Memory accesses

Instruction Miss

Where to misses come from?

Classifying Misses: 3 Cs CompulsoryThe first access to a block is not in the cache,

so the block must be brought into the cache. Also called coldstart misses or first reference misses.(Misses in even an Infinite Cache)

CapacityIf the cache cannot contain all the blocks neededduring execution of a program, capacity misses will occur due toblocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)

Conflict If bl k l t t t i t i ti0.08

0.1

0.12

0.141-way

2-way

4-way

8-w

3Cs Absolute Miss R

ep

er

type


9/28

COEN6741Chap 5.33

11/10/2003

0. Cache Size

Old rule of thumb: 2x size => 25% cut in miss rate What does it reduce? Thrashing reduction!!!

Cache Size (KB)

0

0.02

0.04

0.06

0.080.1

0.12

0.14

1 2 4 8

16

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

11/10/2003

Cache Organ

Assume total cache size not

What happens if:

1) Change Block Size:

2) Change Associativity:

3) Change Compiler:

Which of 3Cs is obviously a

MissRate

5%

10%

15%

20%

25%

1K

4K

16K

64K

1. Larger Block Size (fixed size & assoc)

0.04

0.06

0.08

0.1

0.12

0.141-way

2-way

4-way

8-way

C

2. Higher Asso

Co


10/28

COEN6741Chap 5.37

11/10/2003

3Cs Relative Miss Rate

Cache Size (KB)

0%

20%

40%

60%

80%

100%

1 2 4 816

32

64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

Flaws: for fixed block sizeGood: insight => invention

11/10/2003

Associativity vs..

Beware: Execution time is only

Why is cycle time tied to hit t

Will Clock Cycle time increase? Hill [1988] suggested hit time

external cache +10%,internal + 2%

suggested big and dumb cache

Effective cycle time of assoc

pzrbski ISCA

Example: Avg. Memory Access Timevs.. Miss Rate

Example: assume CCT = 1.10 for 2-way, 1.12 for4-way, 1.14 for 8-way vs.. CCT direct mapped

Cache Size Associativity(KB) 1-way 2-way 4-way 8-way1 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.68

3. Victim Ca

Fast Hit Time + Low Conflict=> Victim Cache

How to combine fast hit time

of direct mappedyet still avoid conflict misses?

Add buffer to place datadiscarded from cache


11/28

COEN6741Chap 5.41

11/10/2003

4. Pseudo-Associativity

How to combine fast hit time of Direct Mapped and have thelower conflict misses of 2-way SA cache?

Divide cache: on a miss, check other half of cache to see ifthere, if so have a pseudo-hit (slow hit)

Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time

Pseudo Hit Time Miss Penalty

Time

11/10/2003

5. Hardware PrefetchingData

E.g., Instruction Prefetchi Alpha 21064 fetches 2 blocks o

Extra block placed in stream b

On miss check stream buffer

Works with data blocks to Jouppi [1990] 1 data stream bu

4KB cache; 4 streams got 43%

Palacharla & Kessler [1994] for streams got 50% to 70% of mis2 64KB, 4-way set associative c

Prefetching relies on havinbandwidth that can be use

6. Software Prefetching Data

Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

Special prefetching instructions cannot cause faults; a form ofspeculative execution

Prefetching comes in two flavors: Binding prefetch: Requests load directly into register.

M b dd d i !

7. Compiler Op

McFarling [1989] reduced caches mion 8KB direct mapped cache, 4 byte

Instructions Reorder procedures in memory so as to r Profiling to look at conflicts(using tools t

Data Merging Arrays: improve spatial locality


12/28

COEN6741Chap 5.45

11/10/2003

Merging Arrays Example

/* Before: 2 sequential arrays */

int val[SIZE];int key[SIZE];

/* After: 1 array of stuctures */

struct merge {

int val;

int key;

};

struct merge merged_array[SIZE];

Reducing conflicts between val & key;improve spatial locality

11/10/2003

Loop Interchang

/* Before */

for (k = 0; k < 100; k = k+1

for (j = 0; j < 100; j = j

for (i = 0; i < 5000;

x[i][j] = 2 *

/* After */

for (k = 0; k < 100; k = k+1

for (i = 0; i < 5000; i =

for (j = 0; j < 100;

x[i][j] = 2 *

Sequential accesses instethrough memory every spatial locality

Loop Fusion Example/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];

/* After */

for (i = 0; i < N; i = i+1)

Blocking Exa/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

{r = 0;

for (k = 0; k < N; k = k+1){

r = r + y[i][k]*z[k][j];};

x[i][j] = r;

};


13/28

COEN6741Chap 5.49

11/10/2003

Blocking Example

/* After */

for (jj = 0; jj< N; jj = jj+B)

for (kk = 0; kk< N; kk = kk+B)

for (i = 0; i < N; i = i+1)

for (j = jj; j< min(jj+B-1,N); j = j+1)

{r = 0;

for (k = kk; k< min(kk+B-1,N); k = k+1) {

r = r + y[i][k]*z[k][j];};

x[i][j] = x[i][j] + r;

};

B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B+N2

Conflict Misses Too?

11/10/2003

Reducing Conflict Mi

Conflict misses in caches n Lam et al [1991] a blocking fac

vs.. 48 despite both fit in cach

Blocking Fa

0

0.05

0.1

0 50

choleskyspice

mxm (nasa7)btrix (nasa7)

tomcatv

gmty (nasa7)

vpenta (nasa7)

Summary of Compiler Optimizations toReduce Cache Misses (by hand) Improving Cache P


2. Reduce the miss penalty3. Reduce the time to hit


14/28

COEN6741Chap 5.53

11/10/2003

Reducing Miss Penalty

Four techniques

Read priority over write on miss Early Restart and Critical Word First on miss

Non-blocking Caches (Hit under Miss, Miss under Miss)

Second Level Cache

Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in

between

First attempts at L2 caches can make things worse, sinceincreased worst case is worse

Out-of-order CPU can hide L1 data cache miss (35clocks), but stall on L2 miss (40100 clocks)?

CPUtime IC CPIExecution

Memory accesses

Instruction Miss rate Miss penalty

Clock cycle time

11/10/2003

1. Read Priority over

(or lo

1. Read Priority over Write on Miss

Write-through w/ write buffers => RAW conflictswith main memory reads on cache misses

If simply wait for write buffer to empty, might increase read

miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read;

if no conflicts, let the memory access continue

W it b k t b ff t h ld di l d bl k

2. Early Restart and Cri

Dont wait for full block to be restarting CPU

Early restartAs soon as the requesarrives, send it to the CPU and let t

Critical Word FirstRequest the missand send it to the CPU as soon as it execution while filling the rest of thecalled wrapped fetch and requested w


15/28

COEN6741Chap 5.57

11/10/2003

3. Non-blocking Caches

Non-blocking cache or lockup-free cache allow datacache to continue to supply cache hits during a miss

requires F/E bits on registers or out-of-order execution requires multi-bank memories

hit under miss reduces the effective miss penaltyby working during miss vs.. ignoring CPU requests

hit under multiple miss or miss under miss mayfurther lower the effective miss penalty byoverlapping multiple misses

Significantly increases the complexity of the cache controller asthere can be multiple outstanding memory accesses

Requires multiples memory banks (otherwise cannot support)

Pentium Pro allows 4 outstanding memory misses

11/10/2003

Value of Hit Under

FP programs on average: AMAT= 0.6

Int programs on average: AMAT= 0.2

8 KB Data Cache, Direct Mapped, 32

Hit Under i Misses

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

e

qntott

espresso

xlisp

com

press

mdljsp2 e

ar

fpppp

to

mcatv

swm256

doduc

s

u2cor

wave5

Integer Floating Point

4. Add a Second-level Cache

L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1

= Hit TimeL2

+ Miss RateL2

x Miss PenaltyL2

AMAT = Hit TimeL1 +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

Comparing Local and G

32 KByte 1st level cache;Increasing 2nd level cache

Global miss rate close tosingle level cache rateprovided L2 >> L1

Dont use local miss rate

L2 not tied to CPU clock


16/28

COEN6741Chap 5.61

11/10/2003

Reducing Misses:Which apply to L2 Cache?

Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity

3. Reducing Conflict Misses via Victim Cache

4. Reducing Conflict Misses via Pseudo-Associativity

5. Reducing Misses by HW Prefetching Instr, Data

6. Reducing Misses by SW Prefetching Data

7. Reducing Capacity/Conf. Misses by CompilerOptimizations

11/10/2003

Relative CPU

Block Siz

11.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.92

16 32 64

1.361.28 1.27

L2 Cache Block Size

32KB L1, 8 byte path to

Improving Cache Performance


2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.

1. Small and Sim

Why Alpha 21164 has 8K8KB data cache + 96KB s

Small data cache and clock

Direct Mapped, on chip


17/28

COEN6741Chap 5.65

11/10/2003

2. Avoiding Address Translation

Send virtual address to cache? Called VirtuallyAddressed Cache or just Virtual Cache vs.. PhysicalCache

Every time process is switched logically must flush the cache;otherwise get false hits Cost is time to flush + compulsory misses from empty cache

Dealing with aliases (sometimes called synonyms);Two different virtual addresses map to same physical address

I/O must interact with cache, so need virtual address

Solution to aliases HW guarantees every cache block has unique physical address

SW guarantee : lower n bits must have same address;as long as covers index field & direct mapped, they must beunique; called page coloring

Solution to cache flush Add process identifier tag that identifies process as well as

address within process: cant get a hit if wrong process 11/10/2003

Virtually Addres

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed Translate only on

Synonym Proble

VATags

Pipeline Tag Check and Update Cache as separatestages; current write tag check & previous write cacheupdate

Only STORES in the pipeline; empty during a miss

Store r2, (r1) Check r1

3. Pipelined Writes Case Study: MIP

8 Stage Pipeline: IFfirst half of fetching of instruc

here as well as initiation of instruc

ISsecond half of access to instruc

RFinstruction decode and register also instruction cache hit detection

EXexecution, which includes effecoperation and branch target compu


18/28

COEN6741Chap 5.69

11/10/2003

Case Study: MIPS R4000

IF ISIF

RFIS

IF

EXRF

ISIF

DFEX

RFISIF

DSDF

EXRFISIF

TCDS

DFEXRFISIF

WBTC

DSDFEXRFISIF

TWO CycleLoad Latency

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

WBTCDSDFEXRFISIF

THREE CycleBranch Latency

(conditions evaluatedduring EX phase)

Delay slot plus two stallsBranch likely cancels delay slot if not taken

11/10/2003

R4000 Perfo Not ideal CPI of 1:

Load stalls (1 or 2 clock cycles

Branch stalls (2 cycles + unfille

FP result stalls: RAW data haz

FP structural stalls: Not enoug

00.5

1

1.5

2

2.5

3

3.5

4

4.5

eqntott

espresso

gcc li

doduc

n a s a 7

Base Load stalls Branch stalls

What is the Impact of What YouveLearned About Caches?

1960-1985: Speed= (no. operations)

1990 Pipelined

Execution &Fast Clock Rate

100

1000CPU

Processor-MemoryPerformance Gap:(grows 50% / year)

Cache Optimizatio

Technique

Larger Block SizeHigher AssociativityVictim CachesPseudo-Associative Caches HW Prefetching of Instr/DataCompiler Controlled PrefetchingC mpil R d c Miss s

missra

te


19/28

COEN6741Chap 5.73

11/10/2003

Cache Cross Cutting Issues

Superscalar CPU & Number Cache Ports

must match: number memoryaccesses/cycle?

Speculative Execution and non-faultingoption on memory/TLB

Parallel Execution vs.. Cache locality Want far separation to find independent operations vs..

want reuse of data accesses to avoid misses

I/O and consistency of data between cacheand memory Caches => multiple copies of data

Consistency by HW or by SW?

Where connect I/O to computer?

11/10/2003

0.01%

0.10%

1.00%

10.00%

100.00%AlphaSort Espresso Sc Mdljsp2 Ear

MissRate

Alpha Memory PerfoRates of SP

I$ miss = 2%D$ miss = 13%L2 miss = 0.6%

I$ miss = 6%

D$ miss = 32%L2 miss = 10%

3.50

4.00

4.50

5.00

L2

Alpha CPI Components Instruction stall: branch mispredict (green);

Data cache (blue); Instruction cache (yellow); L2$ (pink)Other: compute + reg conflicts, structural conflicts

Predicting Cache PerformProg. (ISA, com

4KB Data cache missrate 8%,12%, or28%?

1KB Instr cache missrate 0% 3% or 10%?

20%

25%

30%

35%

D$, T

D$, gcc


20/28

COEN6741Chap 5.77

11/10/2003

Main Memory Background Performance of Main Memory:

Latency: Cache Miss Penalty

Access Time: time between request and word arrives

Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2)

Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms, 1% time)

Addresses divided into 2 halves (Memory as a 2D matrix):

RAS or Row Access Strobe

CAS or Column Access Strobe

Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs.. 1 transistor /bit, area is 10X) Address not divided: Full addreess

Size: DRAM/SRAM 4-8,Cost/Cycle time: SRAM/DRAM 8-16

11/10/2003

Main Memory Deep

Out-of-Core, In-Core,

Core memory?

Non-volatile, magnetic Lost to 4 Kbit DRAM (today

DRAM)

Access time 750 ns, cycle t

DRAM Logical Organization (4 Mbit)

Column Decoder

Sense Amps & I/O

11D

DRAM Physical Organi

BlockRow Dec

Row

BlockRow Dec

Column Address

BlockRow De

I/OI/O

Address


21/28

COEN6741Chap 5.81

11/10/2003

4 Key DRAM Timing Parameters

tRAC: minimum time from RAS line falling to the validdata output.

Quoted as the speed of a DRAM when buy

A typical 4Mb DRAM tRAC = 60 ns

Speed of DRAM since on purchase sheet?

tRC: minimum time from the start of one row accessto the start of the next.

tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns

tCAC: minimum time from CAS line falling to validdata output.

15 ns for a 4Mbit DRAM with a tRAC of 60 ns

tPC: minimum time from the start of one columnaccess to the start of the next.

35 ns for a 4Mbit DRAM with a tRAC of 60 ns

11/10/2003

DRAM Perfor

A 60 ns (tRAC) DRAM can perform a row access only e

perform column access (tCACbetween column accesses is

In practice, external addraround buses make it 40 to

These times do not include tthe addresses off the micromemory controller overhead!

DRAM History

DRAMs: capacity +60%/yr, cost 30%/yr 2.5X cells/area, 1.5X die size in 3 years

98 DRAM fab line costs $2B DRAM only: density, leakage v. speed

Rely on increasing no. of computers & memoryper computer (60% market)

SIMM or DIMM is replaceable unit

DRAM Future: 1

Mitsubish

Blocks 512 x 2 Mb Clock 200 MHz

Data Pins 64


22/28

Main Memory Performance

Simple: CPU, Cache, Bus,Memory same width(32 or 64 bits)

Wide: CPU/Mux 1 word; Mux/

Cache, Bus, Memory Nwords (Alpha: 64 bits &256 bits; UtraSPARC 512)

Interleaved: CPU, Cache, Bus 1 word:

Memory N Modules(4 Modules); example isword interleaved

11/10/2003

Main Memory Pe

Timing model (word size is

1 to send address, 6 access time, 1 to sen

Cache Block is 4 words

Simple M.P. = 4 x (1 Wide M.P. = 1 + 6 Interleaved M.P. = 1 + 6

Independent Memory Banks

Memory banks for independent accessesvs.. faster sequential accesses

Multiprocessor

I/O CPU with Hit under n Misses, Non-blocking Cache

Superbank: all memory active on one blockt f ( B k)

Independent Me

How many banks?

number banks number clockbank

For sequential accesses, ooriginal bank before it has

(like in vector case)


23/28


24/28

COEN6741Chap 5.93

11/10/2003

Main Memory Summary

Wider Memory

Interleaved Memory: for sequential orindependent accesses

Avoiding bank conflicts: SW & HW

DRAM specific optimizations: page mode &Specialty DRAM

DRAM future less rosy?

11/10/2003

DRAM Cross

After 20 years of 4X everyinto wall? (64Mb - 1 Gb)

How can keep $1B fab linesDRAMs per computer?

Cost/bit 30%/yr if stop 4X

What will happen to $40B/y

DRAMs per PC over Time

rySize

DRAM Generation

86 89 92 96 99 021 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

4 MB

8 MB

32 8

16 4

8 2

Virtual Me

A virtual memory is a memusually consisting of at leaand disk, in which the pro

memory references as effa flat address space.

All translations to primaryaddresses are handled tra


25/28

Basic Issues in VM System Designsize of information blocks that are transferred fromsecondary to main storage (M)

block of information brought into M, and M is full, then some regionof M must be released to make room for the new block -->replacement policy

which region of M is to hold the new block --> placement policy

missing item fetched from secondary memory only on the occurrenceof a fault --> demand load policy

Paging Organization

virtual and physical address space partitioned into blocks of equalsize

page frames

pages

pages

reg

cachemem disk

frame

Addressing and AcTwo-Level Hier

Thecomputersystem, HWor SW,

mustperform anyaddresstranslationthat isrequired:

Two ways of forming the address: Segmenta

Paging is more common. Sometimes the two

one on top of the other. More about addre

Miss

Systemaddress

Hit

Memory management unit (MMU

Translation function(mapping tables,permissions, etc.)

Paging vs. Segmentation

Block

System address

Lookup table

Word Block

System address

Lookup table

Word

Base address

Paging Organiz

frame 0

1

7

0

1024

7168

P.A.

PhysicalMemory

1K

1K

1K

AddrTrans

MAP

0

1024

31744

Address Mapping

10

V.A.


26/28

SegmentationOrganization

Notice that each segments virtual address starts at 0, different from itsphysical address.

Repeated movement of segments into and out of physical memory willresult in gaps between segments. This is called external fragmentation.

Compaction routines must be occasionally run to remove these fragments.

Main memory

Segment 1

Segment 5

Gap

Segment 6

Physicalmemory

addresses

Virtualmemory

addresses

0000

0

0

0

0

0

FFF

Segment 9

Segment 3

Gap

11/10/2003

Translation Looka

A way to speed up translation isof recently used page table entrnames, but the most frequently

Lookaside Buffer or TLB

Virtual Address Physical Address D

Really just a cache on the pagTLB access time comparable to

(much less than main mem

Translation Lookaside Buffers

VA PA misshit

Just like any other cache, the TLB can be organizedas fully associative, set associative, or direct mapped

TLBs are usually small, typically not more than 128 -256 entries even on high end machines. This permitsfully Associative lookup on these machines.

Most mid-range machines use small n-way set associativeorganizations.

Address Translat

Page table is a large data s

CPUTrans-lation Cac

VA PA

hit

data


27/28

Overlapped Cache & TLB Access

TLB Cache

10 2

00

4 bytes

index1 K

page # disp

20 12

assoclookup32

PA Hit/Miss

PA Data Hit/Miss

=

IF cache hit AND (cache tag = PA) then deliver data to CPUELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN

access memory with the PA from the TLBELSE do standard VA translation

11/10/2003

Memory Hierarchy

The memory hierarchy: from slow and cheap: Registers

Disk At first, consider just two adhierarchy

The cache: High speed and ex Direct mapped, associative, set ass

Virtual memorymakes the hi Translate the address from CPUs lo

address where the information is ac

The TLB helps in speeding the add

Memory managementhow to and forth

Practical Memory Hierarchy

Issue is NOT inventing new mechanisms

Issue is taste in selecting between manyalternatives in putting together a memoryhierarchy that fit well together

e.g., L1 Data cache write through, L2 Write back

TLB and Virtu

Caches, TLBs, Virtual Memoexamining how they deal withWhere can block be placed? 3) What block is repalced onwrites handled?

Page tables map virtual addr TLBs make virtual memory p Locality in data => locality in addr

temporal and spatial


28/28

DAP Spr.98UCB 35

Alpha 21 064

Separate Instr & DataTLB & Caches

TLBs fully associative

TLB updates in SW(Priv Arch Libr)

Caches 8KB direct

mapped, write thru Critical 8 bytes first

Prefetch instr. streambuffer

2 MB L2 cache, directmapped, WB (off-chip)

256 bit path to mainmemory, 4 x 64-bitmodules

Victim Buffer: to giveread priority over write

4 entry write bufferbetween D$ & L2$

Stream

Buffer

WriteBuffer

Victim Buffer

Instr Data

Documents

Computer Architecture Cache Design