CS6461 – Computer Architecture Fall 2015 Morris Lancaster Adapted from Professor Stephen Kaisler’s Notes Lecture 4 – Memory Systems (Some material extracted

CS6461 – Computer ArchitectureFall 2015

Morris LancasterAdapted from Professor Stephen Kaisler’s Notes

Lecture 4 – Memory Systems

(Some material extracted from slides byArvind (MIT)

Krste Asanovic (MIT/UCB)Joel Emer (Intel/MIT)James Hoe (CMU)

John Kubiatowicz (UCB)David Patterson (UCB)

04/21/23 CS61 Computer Architecture 2

The Ideal Memory

• Size: Infinitely large• Speed: Infinitely fast, e.g., no latency• Cost: Free (well, infinitesimal)

• However, once reality sets in, we realize these features are mutually exclusive and not attainable with the technology available today.

• Tomorrow, however, is another story.


Memory Hierarchy

Increasing distance from the processor in access time

L1$

L2$

Main Memory

Secondary Memory

Processor

(Relative) size of the memory at each level

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in Main Memory that is a subset of is in Secondary Memory (usually disk).

4-64 bytes (word)

1 to 8 blocks

1,024 – 4M bytes (disk sector = page)

16-128 bytes (block)

We show only two levels of cache here, but as we will see in later lectures, some processors have three levels of cache.

Note: page sizes may be much larger, e.g., 64 KBytes or even 1 Mbyte.


Memory Hierarchy

• Memory is much slower compared to processor.• Faster memories are more expensive.• Due to high decoding time and other reasons, larger memories

are always slower.– Therefore, locate small but very fast memory (SRAM: L1 cache)

very close to processor. – L2 cache will be larger and slower, between L1 and the main

memory.– Many processors now have L3 cache (see Multicores)– Main memory is usually GBs large and is made up of DRAMs:

several nanoseconds.

• Secondary memory is on discs (hard disk, CD, DVD, Flash, SSD), is hundreds of GBs to TBs, and takes microseconds to access.

• CPU registers are the closest to CPU, but do not use memory addresses – they have separate ids.


Memory History - 0

See @ Computer History Museum,Mountain View, CA


Memory History - I

Information introduced to the memory in the form of electric pulses was transduced into mechanical waves that propagated relatively slowly through a medium

A relay is an electrically operated switch. Many relays use an electromagnet to operate a switching mechanism mechanically,


Memory History - II

A drum is a large metal cylinder that is coated on the outside surface with a ferromagnetic recording material.

A vacuum tube is a glass enclosure inside which are an anode, cathode, and other filaments. The tube is evacuated to a vacuum.


Memory History - III

The Williams tube depends on an effect called secondary emission. When a dot is drawn on a cathode ray tube, the area of the dot becomes slightly positively charged and the area immediately around it becomes slightly negatively charged, creating a charge well. The charge well remains on the surface of the tube for a fraction of a second, allowing the device to act as a computer memory. The lifetime of the charge well depends on the electrical resistance of the inside of the tube.

Magnetic cores are little ‘donuts’ with three wires passing through them: two select x and y, while the third sets the charge or not on the core.

Who invented magnetic cores?


Core Memory

• Core memory was first large scale reliable main memory

• invented by Forrester in late 40s/early 50s at MIT for Whirlwind project

• Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires

• Coincident current pulses on X and Y wires would write cell and also sense original state (destructive reads)

• Robust, non-volatile storage

• Core access time ~ 1ms


Semiconductor (various types)

• DRAM: Dynamic Random Access Memory– DRAM needs its cells recharged or given a new

charge every few milliseconds.• SRAM: Static Random Access Memory

– SRAM does not need recharging since it operates on a principle of moving current that is switched in one of two directions rather than a storage cell which holds a charge in place.


Modern DRAM


Parity versus Non-Parity

• Parity is an error-detection scheme:– developed to detect errors in data – principally over

communications lines but also applied to storage.– By adding a single bit to each byte of data, one could check the

integrity of the other 8 bits while data is transmitted or moved from storage.

• This led to error-correcting codes, which are a topic in itself.

• Today, memory errors are rare because of the very high quality of the manufacturing process, so most memory is non-parity.

• Still used in some mission-critical systems


Semiconductor Memory Evolution

DRAM

1970 RAM 4.77 MHz

1987 Fast-Page Mode DRAM 20 MHz

1995 Extended Data Output 20 MHz

1997 PC66 Synchronous DRAM 66 MHz

Synchronous DRAM


1999 Rambus DRAM 800 MHz


2000 DDR Synchronous DRAM 266 MHz

2002 Enhanced DRAM 450 MHz

2005 DDR2 660 MHz

2009 DDR3 800 MHz

And so on …


Memory Interleaving/Banking

• Memory interleaving divides memory into banks as shown below.– Addresses are distributed across the banks – 4-way interleaving is

depicted.– Interleaving allows simultaneous access to words in memory if the words

are in separate banks.– As we will see, this may conflict with caching.


Basic Memory

• You can think of computer memory as being one big array of data.– The address serves as an array index.

– Each address refers to one word of data.

• You can read or modify the data at any given memory address, just like you can read or modify the contents of an array at any given index.

• If you’ve worked with pointers in C or C++, then you’ve already worked with memory addresses.

2k x n memory

ADRS OUTDATACSWR

kn

n

2k x n memory

ADRS OUTDATACSWR

kn

nkkknnn

nnnCS WR Memory operation

0 x None1 0 Read selected word1 1 Write selected word


Basic Memory - II

• The above depicts the main interface to RAM. • - A Chip Select, CS, enables or disables the RAM.• - ADRS specifies the address or location to read from or

write to.• - WR selects between reading from or writing to the

memory.• To read from memory, WR should be set to 0.

• OUT will be the n-bit value stored at ADRS.• To write to memory, we set WR = 1.

• DATA is the n-bit value to save in memory.


Basic Memory - III

Ro

w A

dd

ress

D

eco

de

r

Col.1

Col.2M

Row 1

Row 2N

Column Decoder & Sense Amplifiers

M

N

N+M

bit linesword lines

Memory cell(one bit)

DData

Bits stored in 2-dimensional arrays on chip Modern chips have around 4 logical banks on each chip


DRAM Packaging

• DIMM (Dual Inline Memory Module) contains multiple chips with clock/control/address signals connected in parallel (sometimes need buffers to drive signals to all chips)

• Data pins work together to return wide word (e.g., 64-bit data bus using 16x4-bit parts)


Memory System Design: Key Ideas

• The Principle of Locality:– Programs access a relatively small portion of the address space at

any instant of time.

• Instructions and data both exhibit spatial and temporal locality– Temporal locality: If a particular instruction or data item is used

now, there is a good chance that it will be used again in the near future.

– Spatial locality: If a particular instruction or data item is used now, there is a good chance that the instructions or data items that are located in memory immediately following or preceding this item will soon be used.

• Therefore, it is a good idea to move such instruction and data items that are expected to be used soon from slow memory to fast memory (cache).

• BUT! This is prediction, and therefore will not always be correct – depends on the extent of locality.


Simple Example

Simple calculation assuming just the application program:•Assume 1 GHz processor using 10 ns memory,

– 35% of all executed instructions are load or store.

– The application runs 1 billion instructions.

•Straight Memory Execution time = (1*109 + 0.35*10*109) *10-9 = 4.5 s.•Assume all instructions and data that are required are stored in a perfect cache that operates within the clock period.

– Execution time with perfect cache = 1 ns.

•Now, assume that the cache has a hit rate of 90%.– Execution time with cache = (1 + 0.35*0.1*10) = 1.35 s.

(0.1 comes from the 10% cache misses)

•Caches are 95-99% successful in having the required

instructions and 75-90% successful for data.


Cache - I

• A cache is a small (size << main memory), fast memory that temporarily holds data and instructions and makes them available to the processor much faster than main memory.

• Cache space (~MBytes) is smaller than main memory (~GBytes);• Why do caches succeed in improving performance? LOCALITY!• Hit: data appears in some block in the upper level (example:

Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of RAM access

time + Time to determine hit/miss

• Miss: data needs to be retrieve from a block in the lower level (Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level + Time to

deliver the block the processor

• Hit Time << Miss Penalty (500 instructions on Alpha 21264!)


Cache - II


Cache - III

Cache Algorithm (READ)


Cache Performance

• Memory access time = cache hit time or cache miss rate * miss penalty

• To improve performance: reduce memory time

=> we need to reduce hit time, miss rate & miss penalty.• As L1 caches are in the critical path of instruction execution, hit time

is the most important parameter.• When one parameter is improved, others might suffer• Misses:

– Compulsory miss: always occurs first time.

– Capacity miss: reduces with increase in cache size.

– Conflict miss: reduces with level of associativity.

• Types: – Instruction or Data Cache: 1-way or 2-way

– Data Cache: write through & write-back


Cache Topology

• Determines the number and interconnection of caches• Early caches were focused on instructions, while data

were fetched directly from memory• Then, unified caches holding both instructions and data

were used• Then, split caches – one for instructions and one for data• Today, either unified or split depending on processor use


Split vs. Unified Caches

• Advantages of unified caches:– Balance the load between instruction and data fetches

depending on the dynamics of the program execution;– Design and implementation are cheaper.

• Advantages of split caches (Harvard Architectures)– Competition for the cache between instruction processing (which

fetches from instruction cache) and execution functional units (which fetch from data cache) is eliminated

– Instruction fetch can proceed in parallel with memory access from the execution unit.


Cache Writing

• A Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer

– Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:– Typical number of entries: 4

– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

• Memory system designer’s nightmare:– Store frequency (w.r.t. time) > 1 / DRAM write cycle

– Write buffer saturation

ProcessorCache

Write Buffer

DRAM


Cache Writing - II

Q. Why a write buffer ? A. So CPU doesn’t stall

Q. Why a buffer, why not just one register ?

A. Bursts of writes arecommon.

Q. Are Read After Write (RAW) hazards an issue for write buffer?

A. Yes! Drain buffer before next read, or send read 1st after check write buffers.

Note: We will discuss RAW hazards in a future lecture.


Cache Writing - III

• Write: Need to update upper cache(s) and main memory whenever a store instruction modifies L1 cache.– Write Hit: the item to be modified is in L1.– Write Through: as if no L1, write also to L2.– Write Back: set a Dirty Bit, and update L2 before replacing

the block.– Although write through is an inefficient strategy, most L1s

and some upper level caches follow this approach that read hit time is not affected due to complicated logic to update dirty bit.

• Write Miss: the item to be modified is not in L1.– Write allocate: exploit locality, and bring the block to L1.– Write no-allocate: do not fetch the missing block.


Cache Writing - IV


Cache Organization: Summary


Direct-Mapped Cache

• A block can be placed in one location only, given by:

(Block address) MOD (Number of blocks in cache)


Direct Mapped Cache - II

0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1

0 0 0 C a c h e

M e m o r y

0 0 1 0 1 0

0 1 1 1 0 0

1 0 1 1 1 0

1 1 1

A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) In this case: (Block address) MOD (8)

32 memory blocks cacheable

8 cache block frames

(11101) MOD (100) = 101

Cache

Memory


Direct Mapped Cache - III

• A memory block is mapped into a unique cache line, depending on the memory address of the respective block.

• A memory address is considered to be composed of three fields:– the least significant bits (2 in our example) identify the byte within the

block; [assume four bytes/block]

– the rest of the address (22 bits in our example) identify the block in main memory; for the cache logic, this part is interpreted as two fields:

– 2a. the least significant bits (14 in our example) specify the cache line;

– 2b. the most significant bits (8 in our example) represent the tag, which is stored in the cache together with the line.

• Tags are stored in the cache in order to distinguish among blocks which fit into the same cache line.


Direct Mapped Cache - IV

• Advantages:– simple and cheap;– the tag field is short; only those bits have to be stored

which are not used to address the cache (compare with the following approaches);

– access is very fast.

• Disadvantage:– a given block fits into a fixed cache location– a given cache line will be replaced whenever there is a

reference to another memory block which fits to the same line, regardless what the status of the other cache lines is.

– This can produce a low hit ratio, even if only a very small part of the cache is effectively used.


2-Way Associative Cache

• A block can be placed in a restricted set of places, or cache block frames. – A set is a group of block frames in the cache. – A block is first mapped onto the set and then it can be placed anywhere within the set. – The set in this case is chosen by: (Block address) MOD (Number of sets in cache)


Two-Way Set Associative Cache - II

• Location 0 can be occupied by data from:– Memory location 0, 2, 4, 6, 8, ... etc.– In general: any memory location whose LSB of the address

is 0– Address<0> => cache index

• On a miss, the block will be placed in one of the two cache lines belonging to that set which corresponds to the 13 bits field in the memory address.

• The replacement algorithm decides which line to use.• A memory block is mapped into any of the lines of a

set.– The set is determined by the memory address, but the line

inside the set can be any one.


Two-Way, Set Associative Cache - III

• Several tags (corresponding to all lines in the set) have to be checked in order to determine if we have a hit or miss. If we have a hit, the cache logic finally points to the actual line in the cache.

• The number of lines in a set is determined by the designer:– 2 lines/set: two-way set associative mapping;– 4 lines/set: four-way set associative mapping

• Set associative mapping keeps most of the advantages of direct mapping:– short tag field– fast access– relatively simple


Two-Way Set Associative Cache - IV

• Set associative mapping tries to eliminate the main shortcoming of direct mapping– a certain flexibility is given concerning the line to be

replaced when a new block is read into the cache.

• Cache hardware is more complex for set associative mapping than for direct mapping.

• In practice 2 and 4-way set associative mapping are used with very good results.

• Larger sets do not produce further significant performance improvement– Interesting thesis topic: Is this true for multicore

architectures?


Full Associative Cache

• A block can be placed anywhere in cache• Lookup hardware for many tags can be large and slow


Cache Replacement Strategy

• Replacing a block on a cache miss?– Easy for Direct Mapped– Set Associative or Fully Associative:

• Random

• LRU (Least Recently Used)

• FIFO (First In First Out)

• LFU (Least Frequently used)

• LRU is the most efficient: relatively simple to implement and good results.

• FIFO is simple to implement.• Random replacement is the simplest to implement and

results are surprisingly good.


Comparison by Size

• Associativity: 2-way 4-way 8-waySize LRU Random LRU Random LRU Random

16 KB 5.2% 5.7% 4.7% 5.3% 4.4%5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4%1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

• So, what does this tell us?• You don’t gain a significant advantage by going to a

larger cache size.


Cache Performance Revisited

• Suppose a processor executes at – Clock Rate = 1000 MHz (1 ns per cycle)– CPI = 1.0 – 50% arithmetic/logic– 30% load/store– 20% control

• Suppose that 10% of memory operations get 100 cycle miss penalty

• CPI = ideal CPI + average stalls time per instruction= 1.0(cycle)

+ ( 0.30 (data-operations/instruction) * 0.10 (miss/data-op) x 100 (cycle/miss) )= 1.0 cycle + 3.0 cycle

= 4.0 cycle• 75 % of the time the processor is stalled waiting for memory!• A 1% instruction miss rate would add an additional 1.0 cycles to

the CPI!


Lower Miss Rate: 16 KB D/I or 32 KB Unified Cache?

• Assume a hit takes 1 clock cycle and the miss penalty is 50 cycles.• Assume a load or store takes 1 extra clock cycle on a unified cache

since there is only one cache port.• Assume 75% of memory accesses are instruction references

– (75% X 0.64%) + (25% X 6.47%) = 2.10%

• Average memory access time (split)– = 75% X ( 1 + 0.64% X 50) + 25% X ( 1 + 6.47% X 50) = 0.990 + 1.059

= 2.05

• Average memory access time (unified)– = 75% X ( 1 + 1.99% X 50) + 25% X ( 1 + 1 + 1.99% X 50)

– = 1.496 + 0.749 = 2.24


Cache Addressing

• Access caches by virtual address or physical address


Cache Addressing

• Physical or Virtual Address– For Index or tag or both

Tag->Index

Physical Virtual

Physical PP PV

Virtual VP VV

For complex, unusualsystems

Operate concurrently w/ MMU

Must translate address first

Very fast

Must translate address first

MMU checks tag


Summary: Cache Issues

• How many caches?– Write-Through or Write Back?– Direct-Mapped or Set Associative?

• How to determine at a read if we have a miss or hit?• If there is a miss and there is no place for a new slot in

the cache which information should be replaced?• How to preserve consistency between cache and main

memory at write?• Replacement Strategy?


Cache Optimizations

Henk Corperaal, www.ics.ele.tue.nl/~heco/courses/acaGood topics for a term paper!

• Reducing hit time– Small and simple caches– Way prediction– Trace caches

• Increasing cache bandwidth– Pipelined caches– Multibanked caches– Nonblocking caches

• Reducing Miss Penalty– Critical word first– Merging write buffers

• Reducing Miss Rate– Compiler optimizations

• Reducing miss penalty or miss rate via parallelism– Hardware prefetching– Compiler prefetching


Cache Optimizations - I

1. Fast Hit via Small and Simple Caches

Index tag memory and thereafter compare takes time

Small cache is fasterAlso L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip

Simple direct mappingCan overlap tag check with data transmission since no choice


Cache Optimizations - II

2. Fast Hit via Way Prediction

Make set-associative caches faster

Keep extra bits in cache to predict the “way,” or block within the set, of next cache access.

Multiplexor is set early to select desired block, only 1 tag comparison performed

Miss 1st check other blocks for matches in next clock cycle

Accuracy 85%

Drawback: CPU pipeline is stalled if hit takes 1 or 2 cycles


Cache Optimizations - III

3. Fast Hit via Trace Cache

Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line

Single fetch brings in multiple basic blocks

Trace cache indexed by start address and next n branch predictions

+ better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block)

complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size

- instructions may appear multiple times in multiple dynamic traces due to different branch outcomes

BR BR BR

BR

BR

BR


Cache Optimizations - IV

4. Increase Cache Bandwidth by Pipelining

Pipeline cache access to maintain bandwidth, but higher latency

Nr. of Instruction cache access pipeline stages:

1: Pentium

2: Pentium Pro through Pentium III

4: Pentium 4

greater penalty on mispredicted branches

more clock cycles between the issue of the load and the use of the data


Cache Optimizations - V

5. Increasing Cache Bandwidth: Non-blocking cache or lockup-free cache

allow data cache to continue to supply cache hits during a missrequires out-of-order execution CPU

“hit under miss” reduces the effective miss penalty by continuing during miss

“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple missesRequires that memory system can service multiple missesSignificantly increases the complexity of the cache controller as

there can be multiple outstanding memory accessesRequires multiple memory banks (otherwise cannot support)Pentium Pro allows 4 outstanding memory misses


Cache Optimizations - VI

6. Increase Cache Bandwidth via Multiple Banks

Divide cache into independent banks that can support simultaneous accessesE.g., T1 (“Niagara”) L2 has 4 banks

Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system

Simple mapping that works well is “sequential interleaving” Spread block addresses sequentially across banks

E.g., with 4 banks, Bank 0 has all blocks with address%4 = 0;

bank 1 has all blocks whose address%4 = 1; …


Cache Optimizations - VII

7. Early Restart/Critical Word First to reduce miss penalty

Don’t wait for full block to be loaded before restarting CPUEarly restart—As soon as the requested word of the block arrives,

send it to the CPU and continue

Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue while filling the rest of the words in the block

Generally useful only when blocks are large


Cache Optimizations - VIII

8. Merging Write Buffer to Reduce Miss PenaltyWrite buffer to allow processor to continue while waiting to write to memoryE.g., four writes are merged into one buffer entry rather than putting them in

separate buffersLess frequent write backs …


Cache Optimizations - IX

9. Reducing Misses By Compiler Optimizations

InstructionsReorder procedures in memory so as to reduce conflict misses

Profiling to look at conflicts (using developed tools)

DataMerging Arrays: improve spatial locality by single array of compound

elements vs. 2 arrays

Loop Interchange: change nesting of loops to access data in order stored in memory

Loop Fusion: combine 2 independent loops that have same looping and some variables overlap

Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows


Example: Merging Arrays


Example: Loop Interchange


Example: Loop Fusion


Blocking Applied to Array Multiplication





• Conflict misses in caches vs. Blocking size– Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.

48 despite both fit in cache


Cache Optimizations - X

10. Reducing Cache Misses by Hardware PrefetchingUse extra memory bandwidth (if available)

Instruction Prefetching

Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block.

Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer

Data Prefetching

Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages

Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes


Cache Optimizations - XI

11. Reducing Cache Misses by Software- Controlled

PrefetchingData Prefetch

Load data into register (HP PA-RISC loads)

Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

Special prefetching instructions cannot cause faults;a form of speculative execution

Issuing Prefetch Instructions takes time

Is cost of prefetch issues < savings in reduced misses?

Wider superscalar reduces difficulty of issue bandwidth


Cache Optimizations - Summary

Documents

CS6461 – Computer Architecture Fall 2015 Morris Lancaster Adapted from Professor Stephen Kaisler’s Notes Lecture 4 – Memory Systems (Some material extracted