Upload
prahalladreddy
View
217
Download
0
Embed Size (px)
Citation preview
7/29/2019 Computer Architecture Cache Design
1/28
COEN6741Chap 5.1
11/10/2003
(Dr. Sofine Tahar)
COEN6741
Computer Architecture and Design
Chapter 5
Memory Hierarchy Design
11/10/2003
Outline
Introduction
Memory HierarchyCache Memory
Cache Performance
Main Memory
Virtual Memory
Translation Lookaside
Alpha 21064 Example
Computer Architecture Topics
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,
Emerging Technologies
InterleavingBus protocols
RAID
Input/Output and Storage
MemoryHierarchy
Who Cares About the M
100
1000
mance
Processor-DRAM Memory G
7/29/2019 Computer Architecture Cache Design
2/28
LHierarchym
COEN6741Chap 5.5
11/10/2003
Levels of the Memory Hierarchy
CPU Registers100s Bytes
7/29/2019 Computer Architecture Cache Design
3/28
COEN6741Chap 5.9
11/10/2003
A Modern Memory Hierarchy
By taking advantage of the principle of locality: Present the user with as much memory as is available in the
cheapest technology. Provide access at the speed offered by the fastest technology.
Requires servicing faults on the processor
Control
Datapath
Secondary
Storage
(Disk)
Processor
Registers
Main
Memory
(DRAM)
Second
Level
Cache(SRAM)
On-Chip
Cache
1s 10,000,000s
(10s ms)
Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
Tertiary
Storage
(Disk/Tape)
10,000,000,000s
(10s sec)
Ts11/10/2003
The Memory Ab
Association of
7/29/2019 Computer Architecture Cache Design
4/28
7/29/2019 Computer Architecture Cache Design
5/28
Q3: Which block should be replacedon a miss?
Easy for Direct Mapped
Set Associative or Fully Associative: Random LRU (Least Recently Used)
Associativity:
2-way 4-way 8-waySize LRU Rand. LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
11/10/2003
Q4: What happens
Write-through: all writes update cmemory/cache
Can always discard cached data -memory Cache control bit: only a validbit
Write-back: all writes simply upda Cant just discard cached data - m
memory Cache control bits: both validand
Other Advantages: Write-through:
memory (or other processors) a Simpler management of cache Write-back:
much lower bandwidth, since datimes
Better tolerance to long-latenc
Write Policy:(What happens on write-miss?)
Write allocate: allocate new cache line in cache Usually means that you have to do a read miss tofill in rest of the cache-line!
Write Buffer for W
A Write Buffer is needed beMemory Processor: writes data into the ca
Memory controller: write contents
ProcessorCache
Write Buffer
7/29/2019 Computer Architecture Cache Design
6/28
COEN6741Chap 5.21
11/10/2003
Simplest Cache: Direct Mapped
Memory
4 Byte Direct Mapped Cache
Memory Address
0
1
2
3
4
5
6
7
8
9
A
BC
D
E
F
Cache Index
0
1
2
3
Location 0 can be occupied by datafrom:
Memory location 0, 4, 8, ... etc. In general: any memory location
whose 2 LSBs of the address are 0s Address => cache index
Which one should we place in thecache?
How can we tell which one is in thecache?
11/10/2003
Example: 1 KB Direc
For a 2 ** N byte cache: The uppermost (32 - N) bits are a
The lowest M bits are the Byte Se
31
:
Cache Tag Example: 0x50
0x50
Stored as part
of the cache state
Valid Bit
:
Cache Tag
Block address
Set Associative Cache
N-way set associative: N entries for each CacheIndex
N direct mapped caches operates in parallel
Example: Two-way set associative cache Cache Index selects a set from the cache
The two tags in the set are compared to the input in parallel Data is selected based on the tag result
Cache DataCache TagValid Cache Data Cache Tag Valid
Cache Index
Disadvantage of Set A
N-way Set Associative CacheCache:
N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decisio
In a direct mapped cache, CaBEFORE Hit/Miss: Possible to assume a hit and contin
Cache DataCache TagValidCache Inde
7/29/2019 Computer Architecture Cache Design
7/28
Block addressBlockoffset
CPUaddressDatain
Dataout
Tag Index
< 8> < 5>
Valid
Data
=?
4
3
(256
blocks)2
1
Writebuffer
Lower level memory
Tag
4:1 Mux
The organization of the data cache in the Alpha AXP 21064 microprocessor.
11/10/2003
Miss-oriented Approach to M
CPIExecution includes ALU and Memor
Inst
MemAccess
ExecutionCPIICCPUtime
Inst
MemMisses
ExecutionCPIICCPUtime
Cache perfo
Separating out Memory comp AMAT = Average Memory Access T
CPIALUOps does not include memory
Inst
MemAcceCPI
Inst
AluOpsICCPUtime
AluOps
yMissPenaltMissRateHitTimeAMAT
ataata
InstInst
MissPeMissRateHitTime
MissPenMissRateHitTime
Impact on Performance Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cyclemiss penalty
Suppose that 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +[ 0 30 (DataMops/ins)
Unified vs.. Sp
Unified vs. Separate I&D
Example: 16KB I&D: Inst miss rate=0.64%, Da
PrI-Cache-1
Proc
UnifiedCache-1
UnifiedCache-2
Proc
UnifCach
7/29/2019 Computer Architecture Cache Design
8/28
COEN6741Chap 5.29
11/10/2003
How to Improve Cache Performance?
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
yMissPenaltMissRateHitTimeAMAT
11/10/2003
Miss Rate Re
3 Cs: Compulsory, Capacity, Con
0. Larger cache1. Reduce Misses via Larger Block2. Reduce Misses via Higher Assoc3. Reducing Misses via Victim Cach4. Reducing Misses via Pseudo-Ass5. Reducing Misses by HW Prefet6. Reducing Misses by SW Prefetc7. Reducing Misses by Compiler Op
Danger of concentrating on just Prefetching comes in two flavors Binding prefetch: Requests load
Must be correct address an Non-Binding prefetch: Load into
Can be incorrect. Frees HW
CPUtime IC CPIExecution
Memory accesses
Instruction Miss
Where to misses come from?
Classifying Misses: 3 Cs CompulsoryThe first access to a block is not in the cache,
so the block must be brought into the cache. Also called coldstart misses or first reference misses.(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks neededduring execution of a program, capacity misses will occur due toblocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)
Conflict If bl k l t t t i t i ti0.08
0.1
0.12
0.141-way
2-way
4-way
8-w
3Cs Absolute Miss R
ep
er
type
7/29/2019 Computer Architecture Cache Design
9/28
COEN6741Chap 5.33
11/10/2003
0. Cache Size
Old rule of thumb: 2x size => 25% cut in miss rate What does it reduce? Thrashing reduction!!!
Cache Size (KB)
0
0.02
0.04
0.06
0.080.1
0.12
0.14
1 2 4 8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
11/10/2003
Cache Organ
Assume total cache size not
What happens if:
1) Change Block Size:
2) Change Associativity:
3) Change Compiler:
Which of 3Cs is obviously a
MissRate
5%
10%
15%
20%
25%
1K
4K
16K
64K
1. Larger Block Size (fixed size & assoc)
0.04
0.06
0.08
0.1
0.12
0.141-way
2-way
4-way
8-way
C
2. Higher Asso
Co
7/29/2019 Computer Architecture Cache Design
10/28
COEN6741Chap 5.37
11/10/2003
3Cs Relative Miss Rate
Cache Size (KB)
0%
20%
40%
60%
80%
100%
1 2 4 816
32
64
128
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
Flaws: for fixed block sizeGood: insight => invention
11/10/2003
Associativity vs..
Beware: Execution time is only
Why is cycle time tied to hit t
Will Clock Cycle time increase? Hill [1988] suggested hit time
external cache +10%,internal + 2%
suggested big and dumb cache
Effective cycle time of assoc
pzrbski ISCA
Example: Avg. Memory Access Timevs.. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for4-way, 1.14 for 8-way vs.. CCT direct mapped
Cache Size Associativity(KB) 1-way 2-way 4-way 8-way1 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.68
3. Victim Ca
Fast Hit Time + Low Conflict=> Victim Cache
How to combine fast hit time
of direct mappedyet still avoid conflict misses?
Add buffer to place datadiscarded from cache
7/29/2019 Computer Architecture Cache Design
11/28
COEN6741Chap 5.41
11/10/2003
4. Pseudo-Associativity
How to combine fast hit time of Direct Mapped and have thelower conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see ifthere, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC
Hit Time
Pseudo Hit Time Miss Penalty
Time
11/10/2003
5. Hardware PrefetchingData
E.g., Instruction Prefetchi Alpha 21064 fetches 2 blocks o
Extra block placed in stream b
On miss check stream buffer
Works with data blocks to Jouppi [1990] 1 data stream bu
4KB cache; 4 streams got 43%
Palacharla & Kessler [1994] for streams got 50% to 70% of mis2 64KB, 4-way set associative c
Prefetching relies on havinbandwidth that can be use
6. Software Prefetching Data
Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults; a form ofspeculative execution
Prefetching comes in two flavors: Binding prefetch: Requests load directly into register.
M b dd d i !
7. Compiler Op
McFarling [1989] reduced caches mion 8KB direct mapped cache, 4 byte
Instructions Reorder procedures in memory so as to r Profiling to look at conflicts(using tools t
Data Merging Arrays: improve spatial locality
7/29/2019 Computer Architecture Cache Design
12/28
COEN6741Chap 5.45
11/10/2003
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key;improve spatial locality
11/10/2003
Loop Interchang
/* Before */
for (k = 0; k < 100; k = k+1
for (j = 0; j < 100; j = j
for (i = 0; i < 5000;
x[i][j] = 2 *
/* After */
for (k = 0; k < 100; k = k+1
for (i = 0; i < 5000; i =
for (j = 0; j < 100;
x[i][j] = 2 *
Sequential accesses instethrough memory every spatial locality
Loop Fusion Example/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
Blocking Exa/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
7/29/2019 Computer Architecture Cache Design
13/28
COEN6741Chap 5.49
11/10/2003
Blocking Example
/* After */
for (jj = 0; jj< N; jj = jj+B)
for (kk = 0; kk< N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j< min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k< min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B+N2
Conflict Misses Too?
11/10/2003
Reducing Conflict Mi
Conflict misses in caches n Lam et al [1991] a blocking fac
vs.. 48 despite both fit in cach
Blocking Fa
0
0.05
0.1
0 50
choleskyspice
mxm (nasa7)btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
Summary of Compiler Optimizations toReduce Cache Misses (by hand) Improving Cache P
1. Reduce the miss rate,
2. Reduce the miss penalty3. Reduce the time to hit
7/29/2019 Computer Architecture Cache Design
14/28
COEN6741Chap 5.53
11/10/2003
Reducing Miss Penalty
Four techniques
Read priority over write on miss Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in
between
First attempts at L2 caches can make things worse, sinceincreased worst case is worse
Out-of-order CPU can hide L1 data cache miss (35clocks), but stall on L2 miss (40100 clocks)?
CPUtime IC CPIExecution
Memory accesses
Instruction Miss rate Miss penalty
Clock cycle time
11/10/2003
1. Read Priority over
(or lo
1. Read Priority over Write on Miss
Write-through w/ write buffers => RAW conflictswith main memory reads on cache misses
If simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read;
if no conflicts, let the memory access continue
W it b k t b ff t h ld di l d bl k
2. Early Restart and Cri
Dont wait for full block to be restarting CPU
Early restartAs soon as the requesarrives, send it to the CPU and let t
Critical Word FirstRequest the missand send it to the CPU as soon as it execution while filling the rest of thecalled wrapped fetch and requested w
7/29/2019 Computer Architecture Cache Design
15/28
COEN6741Chap 5.57
11/10/2003
3. Non-blocking Caches
Non-blocking cache or lockup-free cache allow datacache to continue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution requires multi-bank memories
hit under miss reduces the effective miss penaltyby working during miss vs.. ignoring CPU requests
hit under multiple miss or miss under miss mayfurther lower the effective miss penalty byoverlapping multiple misses
Significantly increases the complexity of the cache controller asthere can be multiple outstanding memory accesses
Requires multiples memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses
11/10/2003
Value of Hit Under
FP programs on average: AMAT= 0.6
Int programs on average: AMAT= 0.2
8 KB Data Cache, Direct Mapped, 32
Hit Under i Misses
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
e
qntott
espresso
xlisp
com
press
mdljsp2 e
ar
fpppp
to
mcatv
swm256
doduc
s
u2cor
wave5
Integer Floating Point
4. Add a Second-level Cache
L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1
= Hit TimeL2
+ Miss RateL2
x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
Comparing Local and G
32 KByte 1st level cache;Increasing 2nd level cache
Global miss rate close tosingle level cache rateprovided L2 >> L1
Dont use local miss rate
L2 not tied to CPU clock
7/29/2019 Computer Architecture Cache Design
16/28
COEN6741Chap 5.61
11/10/2003
Reducing Misses:Which apply to L2 Cache?
Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by CompilerOptimizations
11/10/2003
Relative CPU
Block Siz
11.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.92
16 32 64
1.361.28 1.27
L2 Cache Block Size
32KB L1, 8 byte path to
Improving Cache Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.
1. Small and Sim
Why Alpha 21164 has 8K8KB data cache + 96KB s
Small data cache and clock
Direct Mapped, on chip
7/29/2019 Computer Architecture Cache Design
17/28
COEN6741Chap 5.65
11/10/2003
2. Avoiding Address Translation
Send virtual address to cache? Called VirtuallyAddressed Cache or just Virtual Cache vs.. PhysicalCache
Every time process is switched logically must flush the cache;otherwise get false hits Cost is time to flush + compulsory misses from empty cache
Dealing with aliases (sometimes called synonyms);Two different virtual addresses map to same physical address
I/O must interact with cache, so need virtual address
Solution to aliases HW guarantees every cache block has unique physical address
SW guarantee : lower n bits must have same address;as long as covers index field & direct mapped, they must beunique; called page coloring
Solution to cache flush Add process identifier tag that identifies process as well as
address within process: cant get a hit if wrong process 11/10/2003
Virtually Addres
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed Translate only on
Synonym Proble
VATags
Pipeline Tag Check and Update Cache as separatestages; current write tag check & previous write cacheupdate
Only STORES in the pipeline; empty during a miss
Store r2, (r1) Check r1
3. Pipelined Writes Case Study: MIP
8 Stage Pipeline: IFfirst half of fetching of instruc
here as well as initiation of instruc
ISsecond half of access to instruc
RFinstruction decode and register also instruction cache hit detection
EXexecution, which includes effecoperation and branch target compu
7/29/2019 Computer Architecture Cache Design
18/28
COEN6741Chap 5.69
11/10/2003
Case Study: MIPS R4000
IF ISIF
RFIS
IF
EXRF
ISIF
DFEX
RFISIF
DSDF
EXRFISIF
TCDS
DFEXRFISIF
WBTC
DSDFEXRFISIF
TWO CycleLoad Latency
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
THREE CycleBranch Latency
(conditions evaluatedduring EX phase)
Delay slot plus two stallsBranch likely cancels delay slot if not taken
11/10/2003
R4000 Perfo Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles
Branch stalls (2 cycles + unfille
FP result stalls: RAW data haz
FP structural stalls: Not enoug
00.5
1
1.5
2
2.5
3
3.5
4
4.5
eqntott
espresso
gcc li
doduc
n a s a 7
Base Load stalls Branch stalls
What is the Impact of What YouveLearned About Caches?
1960-1985: Speed= (no. operations)
1990 Pipelined
Execution &Fast Clock Rate
100
1000CPU
Processor-MemoryPerformance Gap:(grows 50% / year)
Cache Optimizatio
Technique
Larger Block SizeHigher AssociativityVictim CachesPseudo-Associative Caches HW Prefetching of Instr/DataCompiler Controlled PrefetchingC mpil R d c Miss s
missra
te
7/29/2019 Computer Architecture Cache Design
19/28
COEN6741Chap 5.73
11/10/2003
Cache Cross Cutting Issues
Superscalar CPU & Number Cache Ports
must match: number memoryaccesses/cycle?
Speculative Execution and non-faultingoption on memory/TLB
Parallel Execution vs.. Cache locality Want far separation to find independent operations vs..
want reuse of data accesses to avoid misses
I/O and consistency of data between cacheand memory Caches => multiple copies of data
Consistency by HW or by SW?
Where connect I/O to computer?
11/10/2003
0.01%
0.10%
1.00%
10.00%
100.00%AlphaSort Espresso Sc Mdljsp2 Ear
MissRate
Alpha Memory PerfoRates of SP
I$ miss = 2%D$ miss = 13%L2 miss = 0.6%
I$ miss = 6%
D$ miss = 32%L2 miss = 10%
3.50
4.00
4.50
5.00
L2
Alpha CPI Components Instruction stall: branch mispredict (green);
Data cache (blue); Instruction cache (yellow); L2$ (pink)Other: compute + reg conflicts, structural conflicts
Predicting Cache PerformProg. (ISA, com
4KB Data cache missrate 8%,12%, or28%?
1KB Instr cache missrate 0% 3% or 10%?
20%
25%
30%
35%
D$, T
D$, gcc
7/29/2019 Computer Architecture Cache Design
20/28
COEN6741Chap 5.77
11/10/2003
Main Memory Background Performance of Main Memory:
Latency: Cache Miss Penalty
Access Time: time between request and word arrives
Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2)
Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms, 1% time)
Addresses divided into 2 halves (Memory as a 2D matrix):
RAS or Row Access Strobe
CAS or Column Access Strobe
Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs.. 1 transistor /bit, area is 10X) Address not divided: Full addreess
Size: DRAM/SRAM 4-8,Cost/Cycle time: SRAM/DRAM 8-16
11/10/2003
Main Memory Deep
Out-of-Core, In-Core,
Core memory?
Non-volatile, magnetic Lost to 4 Kbit DRAM (today
DRAM)
Access time 750 ns, cycle t
DRAM Logical Organization (4 Mbit)
Column Decoder
Sense Amps & I/O
11D
DRAM Physical Organi
BlockRow Dec
Row
BlockRow Dec
Column Address
BlockRow De
I/OI/O
Address
7/29/2019 Computer Architecture Cache Design
21/28
COEN6741Chap 5.81
11/10/2003
4 Key DRAM Timing Parameters
tRAC: minimum time from RAS line falling to the validdata output.
Quoted as the speed of a DRAM when buy
A typical 4Mb DRAM tRAC = 60 ns
Speed of DRAM since on purchase sheet?
tRC: minimum time from the start of one row accessto the start of the next.
tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns
tCAC: minimum time from CAS line falling to validdata output.
15 ns for a 4Mbit DRAM with a tRAC of 60 ns
tPC: minimum time from the start of one columnaccess to the start of the next.
35 ns for a 4Mbit DRAM with a tRAC of 60 ns
11/10/2003
DRAM Perfor
A 60 ns (tRAC) DRAM can perform a row access only e
perform column access (tCACbetween column accesses is
In practice, external addraround buses make it 40 to
These times do not include tthe addresses off the micromemory controller overhead!
DRAM History
DRAMs: capacity +60%/yr, cost 30%/yr 2.5X cells/area, 1.5X die size in 3 years
98 DRAM fab line costs $2B DRAM only: density, leakage v. speed
Rely on increasing no. of computers & memoryper computer (60% market)
SIMM or DIMM is replaceable unit
DRAM Future: 1
Mitsubish
Blocks 512 x 2 Mb Clock 200 MHz
Data Pins 64
7/29/2019 Computer Architecture Cache Design
22/28
Main Memory Performance
Simple: CPU, Cache, Bus,Memory same width(32 or 64 bits)
Wide: CPU/Mux 1 word; Mux/
Cache, Bus, Memory Nwords (Alpha: 64 bits &256 bits; UtraSPARC 512)
Interleaved: CPU, Cache, Bus 1 word:
Memory N Modules(4 Modules); example isword interleaved
11/10/2003
Main Memory Pe
Timing model (word size is
1 to send address, 6 access time, 1 to sen
Cache Block is 4 words
Simple M.P. = 4 x (1 Wide M.P. = 1 + 6 Interleaved M.P. = 1 + 6
Independent Memory Banks
Memory banks for independent accessesvs.. faster sequential accesses
Multiprocessor
I/O CPU with Hit under n Misses, Non-blocking Cache
Superbank: all memory active on one blockt f ( B k)
Independent Me
How many banks?
number banks number clockbank
For sequential accesses, ooriginal bank before it has
(like in vector case)
7/29/2019 Computer Architecture Cache Design
23/28
7/29/2019 Computer Architecture Cache Design
24/28
COEN6741Chap 5.93
11/10/2003
Main Memory Summary
Wider Memory
Interleaved Memory: for sequential orindependent accesses
Avoiding bank conflicts: SW & HW
DRAM specific optimizations: page mode &Specialty DRAM
DRAM future less rosy?
11/10/2003
DRAM Cross
After 20 years of 4X everyinto wall? (64Mb - 1 Gb)
How can keep $1B fab linesDRAMs per computer?
Cost/bit 30%/yr if stop 4X
What will happen to $40B/y
DRAMs per PC over Time
rySize
DRAM Generation
86 89 92 96 99 021 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
4 MB
8 MB
32 8
16 4
8 2
Virtual Me
A virtual memory is a memusually consisting of at leaand disk, in which the pro
memory references as effa flat address space.
All translations to primaryaddresses are handled tra
7/29/2019 Computer Architecture Cache Design
25/28
Basic Issues in VM System Designsize of information blocks that are transferred fromsecondary to main storage (M)
block of information brought into M, and M is full, then some regionof M must be released to make room for the new block -->replacement policy
which region of M is to hold the new block --> placement policy
missing item fetched from secondary memory only on the occurrenceof a fault --> demand load policy
Paging Organization
virtual and physical address space partitioned into blocks of equalsize
page frames
pages
pages
reg
cachemem disk
frame
Addressing and AcTwo-Level Hier
Thecomputersystem, HWor SW,
mustperform anyaddresstranslationthat isrequired:
Two ways of forming the address: Segmenta
Paging is more common. Sometimes the two
one on top of the other. More about addre
Miss
Systemaddress
Hit
Memory management unit (MMU
Translation function(mapping tables,permissions, etc.)
Paging vs. Segmentation
Block
System address
Lookup table
Word Block
System address
Lookup table
Word
Base address
Paging Organiz
frame 0
1
7
0
1024
7168
P.A.
PhysicalMemory
1K
1K
1K
AddrTrans
MAP
0
1024
31744
Address Mapping
10
V.A.
7/29/2019 Computer Architecture Cache Design
26/28
SegmentationOrganization
Notice that each segments virtual address starts at 0, different from itsphysical address.
Repeated movement of segments into and out of physical memory willresult in gaps between segments. This is called external fragmentation.
Compaction routines must be occasionally run to remove these fragments.
Main memory
Segment 1
Segment 5
Gap
Segment 6
Physicalmemory
addresses
Virtualmemory
addresses
0000
0
0
0
0
0
FFF
Segment 9
Segment 3
Gap
11/10/2003
Translation Looka
A way to speed up translation isof recently used page table entrnames, but the most frequently
Lookaside Buffer or TLB
Virtual Address Physical Address D
Really just a cache on the pagTLB access time comparable to
(much less than main mem
Translation Lookaside Buffers
VA PA misshit
Just like any other cache, the TLB can be organizedas fully associative, set associative, or direct mapped
TLBs are usually small, typically not more than 128 -256 entries even on high end machines. This permitsfully Associative lookup on these machines.
Most mid-range machines use small n-way set associativeorganizations.
Address Translat
Page table is a large data s
CPUTrans-lation Cac
VA PA
hit
data
7/29/2019 Computer Architecture Cache Design
27/28
Overlapped Cache & TLB Access
TLB Cache
10 2
00
4 bytes
index1 K
page # disp
20 12
assoclookup32
PA Hit/Miss
PA Data Hit/Miss
=
IF cache hit AND (cache tag = PA) then deliver data to CPUELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN
access memory with the PA from the TLBELSE do standard VA translation
11/10/2003
Memory Hierarchy
The memory hierarchy: from slow and cheap: Registers
Disk At first, consider just two adhierarchy
The cache: High speed and ex Direct mapped, associative, set ass
Virtual memorymakes the hi Translate the address from CPUs lo
address where the information is ac
The TLB helps in speeding the add
Memory managementhow to and forth
Practical Memory Hierarchy
Issue is NOT inventing new mechanisms
Issue is taste in selecting between manyalternatives in putting together a memoryhierarchy that fit well together
e.g., L1 Data cache write through, L2 Write back
TLB and Virtu
Caches, TLBs, Virtual Memoexamining how they deal withWhere can block be placed? 3) What block is repalced onwrites handled?
Page tables map virtual addr TLBs make virtual memory p Locality in data => locality in addr
temporal and spatial
7/29/2019 Computer Architecture Cache Design
28/28
DAP Spr.98UCB 35
Alpha 21 064
Separate Instr & DataTLB & Caches
TLBs fully associative
TLB updates in SW(Priv Arch Libr)
Caches 8KB direct
mapped, write thru Critical 8 bytes first
Prefetch instr. streambuffer
2 MB L2 cache, directmapped, WB (off-chip)
256 bit path to mainmemory, 4 x 64-bitmodules
Victim Buffer: to giveread priority over write
4 entry write bufferbetween D$ & L2$
Stream
Buffer
WriteBuffer
Victim Buffer
Instr Data