Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap...

Appendix CMemory Hierarchy

Why care about memory hierarchies?

10,000

100,000

1980 1985 1990 1995 2000 2005 2010

Memory

ProcessorProcessor-MemoryPerformance GapGrowing

Major source of stall cycles: memory accesses

Levels of the Memory Hierarchy

CPU Registers100s Bytes<0.5 ns

CacheK Bytes1 ns1-0.1 cents/bit

Main MemoryM Bytes100ns$.0001-.00001 cents /bit

DiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min10 -8

Registers

Memory

Instr. Operands

Blocks

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Motivating memory hierarchies Two structures that hold data

Registers: small array of storage Memory: large array of storage

What characteristics would we like memory to have? High capacity Low latency Low cost

Can’t satisfy these requirements with one memory technology

Memory hierarchy Solution: use a little bit of everything!

Small SRAM array (cache) Small means fast and cheap

Larger DRAM array (Main memory) Hope you rarely have to use it

Extremely large disk Costs are decreasing at a faster rate than we fill them

Terminology Find data you want at a given level: hit Data is not present at that level: miss

In this case, check the next lower level Hit rate: Fraction of accesses that hit at a

given level (1 – hit rate) = miss rate

Another performance measure: average memory access time

AMAT = (hit time) + (miss rate) x (miss penalty)

Memory hierarchy operation We’d like most accesses to use the cache

Fastest level of the hierarchy But, the cache is much smaller than the

address space Most caches have a hit rate > 80%

How is that possible? Cache holds data most likely to be accessed

Principle of locality Programs don’t access data randomly—they

display locality in two forms Temporal locality: if you access a memory

location (e.g., 1000), you are more likely to re-access that location than some random location

Spatial locality: if you access a memory location (e.g., 1000), you are more likely to access a location near it (e.g., 1001) than some random location

Cache Basics Fast (but small) memory close to processor When data referenced

If in cache, use cache instead of memory If not in cache, bring into cache

(actually, bring entire block of data, too) Maybe have to kick something else out to do it!

Important decisions Placement: where in the cache can a block go? Identification: how do we find a block in cache? Replacement: what to kick out to make room in

cache? Write policy: What do we do about stores?

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level?

(Block placement) Q2: How is a block found if it is in the upper level?

(Block identification) Q3: Which block should be replaced on a miss?

(Block replacement) Q4: What happens on a write?

(Write strategy)

Q1: Cache Placement Placement

Which memory blocks are allowedinto which cache lines

Placement Policies Direct mapped (block can go to only one line) Fully Associative (block can go to any line) Set-associative (block can go to one of N lines)

E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this

(E.g., if N=1 we get a direct-mapped cache)

Q1: Block placement Block 12 placed in 8 block cache:

Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Fully AssociativeDirect Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

Q2: Cache Identification When address referenced, need to

Find whether its data is in the cache If it is, find where in the cache This is called a cache lookup

Each cache line must have A valid bit (1 if line has data, 0 if line empty)

We also say the cache line is valid or invalid A tag to identify which block is in the line

(if line is valid)

Q2: Block identification Tag on each block

No need to check index or block offset Increasing associativity shrinks index,

expands tag

BlockOffset

Block Address

IndexTag

Address breakdown Block offset: byte address within block

# block offset bits = log2(block size)

Index: line (or set) number within cache # index bits = log2(# of cache lines)

Tag: remaining bits

TagBlockoffset

Address breakdown example Given the following:

32-bit address 32 KB direct-mapped cache Each block has 64-byte

What are the sizes for the tag, index, and block offset fields?

index = 9 bits since there are 32KB/64B = 29 blocks

block offset = 6 bits since each block has 64B= 26

tag = 32 – 9 – 6 = 17 bits

Q3: Block replacement When we need to evict a line, what do we

choose? Easy choice for direct-mapped What about set-associative or fully-associative?

Want to choose data that is least likely to be used next Temporal locality suggests that’s the line that was

accessed farthest in the past Least recently used (LRU)

Hard to implement exactly in hardware—often approximated

Random (randomly selected line) FIFO (line that has been in cache the longest)

Q4: What happens on a write? Write-Through Write-Back

Policy

Data written to cache block

also written to lower-level

memory

Write data only to the cache

Update lower level when a

block falls out of the cache

Debug Easy Hard

Do read misses produce writes? No Yes

Do repeated writes make it to lower level?

Yes No

Write Policy Do we allocate cache lines on a write?

Write-allocate A write miss brings block into cache

No-write-allocate A write miss leaves cache as it was

Do we update memory on writes? Write-through

Memory immediately updated on each write Write-back

Memory updated when line replaced

Write Buffers for Write-Through Caches

Q. Why a write buffer ?

ProcessorCache

Write Buffer

Lower Level

Memory

Holds data awaiting write-through to lower level memory

A. So CPU doesn’t stall Q. Why a buffer,

why not just one register ?

A. Bursts of writes arecommon.

Write-Back Caches Need a Dirty bit for each line

A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it

Memory not updated yet, cache has the only up-to-date copy of data for a dirty line

Replacing a dirty line Must write data back to memory (write-back)

Basic cache design Cache memory can copy data from any part

of main memory Tag: Memory address Block: Actual data

On each access Compare the address with the tag

If they match hit! Get the data from the cache block

If they don’t miss Get the data from main memory

Cache organization Cache consists of multiple tag/block pairs,

called cache lines/blocks Can search lines in parallel (within reason) Each line also has a valid bit Write-back caches have a dirty bit

Note that block sizes can vary Most systems use between 32 and 128 bytes Larger blocks exploit spatial locality Larger block size smaller tag size

Direct-mapped cache example Assume the following simple setup

Only 2 levels to hierarchy 16-byte memory 4-bit addresses Cache organization

Direct-mapped 8 total bytes 2 bytes per block 4 lines Write-back cache

Leads to the following address breakdown: Offset: 1 bit Index: 2 bits Tag: 1 bit

Direct-mapped cache example: initial stateInstructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

13 200

14 210

15 225

V D Tag Data

0 0 0 0 0

Registers:$t0 = ?, $t1 = ?

Block #

Direct-mapped cache example: access #1Instructions:

Memory

13 200

14 210

15 225

V D Tag Data

0 0 0 0 0

Address = 1 = 00012

Tag = 0 Index = 00 Offset = 1

Hits: 0Misses: 0

Registers:$t0 = ?, $t1 = ?

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 0 78 29

0 0 0 0 0

Address = 1 = 00012

Hits: 0Misses: 1

Registers:$t0 = 29, $t1 = ?

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 0 78 29

0 0 0 0 0

Address = 8 = 10002

Hits: 0Misses: 1

Registers:$t0 = 29, $t1 = ?

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

Address = 8 = 10002

Hits: 0Misses: 2

Registers:$t0 = 29, $t1 = 18

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

Address = 4 = 01002

Hits: 0Misses: 2

Registers:$t0 = 29, $t1 = 18

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 0 18 150

0 0 0 0 0

Address = 4 = 01002

Hits: 0Misses: 3

Registers:$t0 = 29, $t1 = 18

04/20/23 31M. Geiger CIS 570 Lec. 13

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 0 18 150

0 0 0 0 0

Address = 13 = 11012

Hits: 0Misses: 3

Registers:$t0 = 29, $t1 = 18

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 0 18 150

0 0 0 0 0

Address = 13 = 11012

Hits: 0Misses: 4

Registers:$t0 = 29, $t1 = 18

Must write backdirty block

04/20/23 33M. Geiger CIS 570 Lec. 13

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 1 19 29

0 0 0 0 0

Address = 13 = 11012

Hits: 0Misses: 4

Registers:$t0 = 29, $t1 = 18

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 1 19 29

0 0 0 0 0

Address = 9 = 10012

Hits: 0Misses: 4

Registers:$t0 = 29, $t1 = 18

04/20/23 35M. Geiger CIS 570 Lec. 13

Block #

Memory

13 200

14 210

15 225

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 1 19 29

0 0 0 0 0

Address = 9 = 10012

Hits: 1Misses: 4

Registers:$t0 = 29, $t1 = 21

04/20/23 36M. Geiger CIS 570 Lec. 13

Block #

Cache performance Simplified model:

CPU time = (CPU clock cycles + memory stall cycles) cycle time

memory stall cycles = # of misses x miss penalty = IC X Misses/instruction x miss penalty = IC x memory accesses/instruction x miss rate miss penalty

Average CPI = CPI(without stalls)+ memory accesses/instruction x miss rate miss penalty

AMAT = hit time + miss rate x miss penalty

Example A computer has CPI =1 when all hits. Loads and

stores are 50% of instructions. If the miss penalty is 25 cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits

For all hits: CPU time =(ICxCPI +0)xCCT =ICx1.0 x CCT Real cache with stalls

Memory stall cycles =IC x (1+0.5)x0.02x25 = ICx0.75 CPU time = (ICx1.0+ICx0.75)xCCT = 1.75ICx CCT

Speedup = 1.75ICxCCT/ICxCCT = 1.75

Average memory access time For unified cache

AMAT = (hit time) + (miss rate) x (miss penalty) For split cache

AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)

For multi-level cache AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = hit timeL1

+ miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) Miss rate(L2) is measured on the leftovers from L1 cache.

Example (split cache vs unified cache) Which has the lower miss rate? A 16KB instruction cache with a 16KB data cache or a 32KB unified cache? If the miss per 1000 instructions for instruction, data and unified caches are 3.82, 40.9 and 43.3, respectively. Assume 36% of instructions are data transfer instructions. Assume a hit takes 1 CC and the miss penalty is 200 cc. A load or store hit takes 1 extra cc on a unified cache. Which is the AMAT?

Find miss rate = miss/instructions / memory access/instruction Miss rate(I) = 3.82/1000 / 1 = 0.004 Miss rate(D) = 40.9/1000 / 0.36 = 0.114 Miss rate (U) = 43.3/1000 / (1+0.36) = 0.0318 Miss rate (Split) = 74%x0.0004 + 26%x0.114 = 0.0326 A 32KB unified cache has a slightly lower miss rate

Example (cont.)

AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)

AMAT (split) = 74%(1 + 0.004x200)+ 26%x(1+0.114x200) =7.52

AMAT(unified) = 74%x(1+0.0318x200) + 26%(1+1+0.0318x200) = 7.62

Another Example (multilevel cache) Suppose that in 1000 memory reference there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the miss rates? Assume miss penalty from L2 cache to memory is 200 CC, the hit time of the L2 cache is 10 CC, the hit time of L1 is 1 CC. What is the AMAT?

Miss rate (L1) = 40/1000 = 4% Miss rate (L2) = 20/40 = 50%

AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = = hit timeL1 + miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) = 1 + 4%(10+ 50%x200) = 5.4 CC

Reasons for cache misses AMAT = (hit time) + (miss rate) x (miss penalty) Reduce misses improve performance The three C’s

First reference to an address: Compulsory miss Increasing the block size

Cache is too small to hold data: Capacity miss Increase the cache size

Replaced from a busy line or set: Conflict miss Increase associativity Would have had hit in a fully associative cache

Six Basic Cache Optimizations Reducing Miss Rate

1. Larger Block size (Compulsory misses)

2. Larger Cache size (Capacity misses)

3. Higher Associativity (Conflict misses Reducing Miss Penalty

4. Multilevel Caches

5. Giving Read misses Priority over Writes e.g., Read complete before earlier writes in write buffer

Reducing hit time6. Avoiding Address Translation during Cache Indexing

Problems with memory DRAM is too expensive to buy many

gigabytes We need our programs to work even if they

require more memory than we have A program that works on a machine with 512 MB

should still work on a machine with 256 MB Most systems run multiple programs

Solutions Leave the problem up to the programmer

Assume programmer knows exact configuration Overlays

Compiler identifies mutually exclusive regions Virtual memory

Use hardware and software to automatically translate references from virtual address (what the programmer sees) to physical address (index to DRAM or disk)

Benefits of virtual memory

CPU Memory

A0-A31 A0-A31

D0-D31 D0-D31

User programs run in a standardizedvirtual address space

Address Translation hardware managed by the operating system (OS)

maps virtual address to physical memory

“Physical Addresses”

AddressTranslation

Virtual Physical

“Virtual Addresses”

Hardware supports “modern” OS features:Protection, Translation, Sharing

Managing virtual memory Effectively treat main memory as a cache

Blocks are called pages Misses are called page faults

Virtual address consists of virtual page number and page offset

Virtual page number Page offset01131

Page tables encode virtual address spaces

A machine usually supports

pages of a few sizes

(MIPS R4000):

A valid page table entry codes physical memory “frame” address for the page

A virtual address spaceis divided into blocks

of memory called pages

Physical Address Space

Virtual Address Space

Page tables encode virtual address spaces

A machine usually supports

pages of a few sizes

(MIPS R4000):

PhysicalMemory Space

A valid page table entry codes physical memory “frame” address for the page

A virtual address spaceis divided into blocks

of memory called pagesframe

A page table is indexed by a virtual address

virtual address

Page Table

04/20/23 50

PhysicalMemory Space

Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry)

Virtual memory => treat memory cache for disk

Details of Page Table Virtual Address

Page Table

indexintopagetable

Page TableBase Reg

V AccessRights PA

V page no. offset12

table locatedin physicalmemory

P page no. offset12

Physical Address

virtual address

Page Table

Paging the page tableA table for 4KB pages for a 32-bit address

space has 1M entries Each process needs its own address space!

P1 index P2 index Page Offset

31 12 11 02122

32 bit virtual address

Top-level table wired in main memory

Subset of 1024 second-level tables in main memory; rest are on disk or

unallocated

Two-level Page Tables

04/20/23 52M. Geiger CIS 570 Lec. 13

VM and Disk: Page replacement policy

Page Table

useddirty

1 00 11 10 0

Set of all pagesin Memory Tail pointer:

Clear the usedbit in thepage table

Head pointerPlace pages on free list if used bitis still clear.Schedule pages with dirty bit set tobe written to disk.

Freelist

Free Pages

Dirty bit: page written.

Used bit: set to

1 on any reference

Architect’s role: support setting dirty and used

bits04/20/23 53M. Geiger CIS 570 Lec. 13

Virtual memory performance Address translation requires a physical

memory access to read the page table Must then access physical memory again to

actually get the data Each load performs at least 2 memory reads Each store performs at least 1 memory read

followed by a write

04/20/23 54M. Geiger CIS 570 Lec. 13

Improving virtual memory performance Use a cache for common translations

Translation lookaside buffer (TLB)

Virtual page

v tag Physical page

Pg offset

04/20/23 55M. Geiger CIS 570 Lec. 13

Caches and virtual memory Using two different addresses: virtual and

physical Which should we use to access cache? Physical address

Pros: simpler to manage Cons: slower access

Virtual address Pros: faster access Cons: aliasing, difficult management

Use both: virtually indexed, physically tagged

Three Advantages of Virtual Memory Translation:

Program can be given consistent view of memory, even though physical memory is scrambled

Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical

memory. Contiguous structures (like stacks) use only as much physical memory as

necessary yet still grow later. Protection:

Different threads (or processes) protected from each other. Different pages can be given special behavior

(Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs

Sharing: Can map same physical page to multiple users

(“Shared memory”) Allows programs to share same physical memory without knowing what else

is thereMakes memory appear larger than it actually is

Average memory access time AMAT = (hit time) + (miss rate) x (miss

penalty) Given the following:

Cache: 1 cycle access time Memory: 100 cycle access time Disk: 10,000 cycle access time

What is the average memory access time if the cache hit rate is 90% and the memory hit rate is 80%?

Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap...

Documents

GPU Optimization Fundamentalscs475/Current_Semester/Lectures/GPU_Opt_Fund-CW1.pdfUse memory efficiently Coalesce global memory accesses ... Global memory throughput Shared memory access

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 6 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 2 – Concept School

Generating a software loop with memory accesses

Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is

1 Memory Management Managing memory hierarchies. 2 Memory Management Ideally programmers want memory that is –large –fast –non volatile –transparent Memory

COSC 6385 Computer Architecture - Memory Hierarchies (II) · - Memory Hierarchies (II) Edgar Gabriel Spring 2014 Cache Performance Avg. memory access time = Hit time + Miss rate x

Efficient Data Access in Future Memory Hierarchies

Principle of Locality: Memory Hierarchies

Linearizing Irregular Memory Accesses for Improved ...lin/papers/micro13.pdfLinearizing Irregular Memory Accesses for Improved Correlated Prefetching ... MICRO ’46, December 7-11,

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Architectures

Experience Report: Memory Accesses for Avionic ... · Experience Report: Memory Accesses for Avionic Applications and Operating Systems on a Multi-Core Platform Andreas Lofwenmark¨

Single Processor Machines: Memory Hierarchies and Processor Features

Software Prefetching for Indirect Memory Accesses: A

Memory Hierarchies - UiO · Memory Hierarchies Carsten Griwodz February 9, 2021 IN5050: ... PC per SIMD cell -> nearly generic. University of Oslo IN5050 ... −Transferring memory

Algorithms for Memory Hierarchies Lecture 3algo2.iti.kit.edu/download/mem_hierarchy_03.pdf · Algorithms for Memory Hierarchies Lecture 3 Lecturer: ... 4 return btree.find(x); 5

Microprocessor-based systems Curse 7 Memory hierarchies

SOFTWARE-DEFINED MEMORY HIERARCHIES SCALABILITY AND QOS IN

Memory Hierarchies - Forsiden€¦ · Memory Hierarchies Carsten Griwodz October 2, 2018 IN5050: Programming heterogeneous multi-core processors

Extending SELinux to track memory-pages accesses · Extending SELinux to track memory-pages accesses Martin Peres, supervised by J er emy Bri aut ENSI de Bourges February 11, 2011

COSC 6385 Computer Architecture - Memory Hierarchies (II)