CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy

CS1104 – Computer CS1104 – Computer OrganizationOrganization

PART 2: Computer PART 2: Computer ArchitectureArchitecture

Lecture 10Lecture 10

Memory HierarchyMemory Hierarchy

Topics

Memories: Types Memory Hierarchy: why? Basics of caches Measuring cache performance Improving cache performance Framework for memory hierarchies

SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) access time: 5-25 ns Cost (US$) per MByte in 1997: 100 to 250

value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) access time: 60-120 ns Cost (US$) per MByte in 1997: 5 to 10

Memories: Review

Memory Hierarchy: why?

Users want large and fast memories! SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.

Try and give it to them anyway build a memory hierarchy CPU

Level 1

Level 2

Level n

Memory Hierarchy: requirements

If level is closer to Processor, it must… Be smaller Be faster Contain a subset (most recently used data) of lower levels

beneath it Contain all the data in higher levels above it

Lowest Level (usually disk or the main memory) contains all the available data

Locality

A principle that makes having a memory hierarchy a good idea

If an item is referenced,

temporal locality: it will tend to be referenced again soonspatial locality : nearby items will tend to be referenced soon.

Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level

Exploiting locality

Caches

Two issues: How do we know if a data item is in the cache? If it is, how do we find it?

Our first example: block size is one word of data "direct mapped"

For each item of data at the lower level, there is exactly one location in the cache where it might be.

e.g., lots of items at the lower level share locations in the upper level

Direct Mapped Cache

Cache Location 0 can be occupied by data from: Memory location 0, 4, 8, ... In general: any memory location that is

multiple of 4

MemoryMemory Address

0123456789ABCDEF

4 Byte Direct Mapped Cache

Cache Index

Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

Memory

Issues with Direct Mapped Caches

1. Since multiple memory addresses map to same cache index, how do we tell which one is in there?

2. What if we have a block size > 1 byte? Solution: divide memory address into three fields

ttttttttttttttttt iiiiiiiiii oooo

tag index byte

to check to offset

if have selectwithin correct block blockblock

Direct Mapped Caches: Terminology

All fields are read as unsigned integers. Index: specifies the cache index (which “row” of the cache

we should look in) Offset: once we’ve found correct block, specifies which byte

within the block we want Tag: the remaining bits after offset and index are determined;

these are used to distinguish between all the memory addresses that map to the same location

Direct Mapped Cache: Example

Suppose we have a 16KB direct-mapped cache with 4 word blocks.

Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture.

Offset

need to specify correct byte within a block block contains 4 words = 16 bytes = 24 bytes need 4 bits to specify correct byte

Index need to specify correct row in cache cache contains 16 KB = 214 bytes block contains 24 bytes (4 words) # rows/cache = # blocks/cache (since

there’s one block/row)

= bytes/cachebytes/row = 214 bytes/cache

24 bytes/row = 210 rows/cache

need 10 bits to specify this many rows

Tag use remaining bits as tag tag length = mem addr length

- offsetindex

= 32 - 4 - 10 bits = 18 bits

so tag is leftmost 18 bits of memory address

For MIPS:

Direct Mapped Cache

Byteoffset

Valid Tag DataIndex

Hit Data

31 30 13 12 1 1 2 1 0Address (bit positions)

32-bit byte addresses Direct-mapped cache of size 2n words with one-word (4-byte) blocks

What is the size of the “tag field”? 32 – (n+2) bits (2 bits for byte offset and n bits for index)

What is the total number of bits in the cache?

2n x (block size + tag size + valid field size)

= 2n x (32 + (32 – n – 2) + 1) because the block size is 32 bits

= 2n x (63 – n)

#Bits required (example)

Taking advantage of spatial locality:

Direct Mapped Cache

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

4Kentries

16 bits 128 bits

32 32 32

Block offsetIndex

31 16 15 4 32 1 0

Address (bit positions)

Accessing data in a direct mapped cache

Example: 16KB, direct-mapped, 4 word blocks

Read 4 addresses 0x00000014, 0x0000001C,

0x00000034, 0x00008014 Memory values on right:

only cache/memory level of hierarchy

Address (hex) Value of WordMemory

0000001000000014000000180000001C

... ...

0000003000000034000000380000003C

0000801000008014000080180000801C

... ...

Accessing data in a direct mapped cache

4 Addresses: 0x00000014, 0x0000001C, 0x00000034, 0x00008014

4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields

000000000000000000 0000000001 0100

000000000000000000 0000000001 1100

000000000000000000 0000000011 0100

000000000000000010 0000000001 0100 Tag Index Offset

Read hits this is what we want!

Read misses stall the CPU, fetch block from memory, deliver to cache, restart the load

instruction

Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later)

Write misses: read the entire block into the cache, then write the word (allocate on write

miss) do not read the cache line; just write to memory (no allocate on write miss)

Hits vs. Misses

Make reading multiple words easier by using banks of memory

It can get a lot more complicated...

Hardware Issues

Memory

a. One-word-wide memory organization

b. Wide memory organization

Memory

Multiplexor

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization

Increasing the block size tends to decrease miss rate:

Performance

256 KB

Block size (bytes)

Performance

Use split caches because there is more spatial locality in code:

ProgramBlock size in

wordsInstruction miss rate

Data miss rate

Effective combined miss rate

gcc 1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%

spice 1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%

Memory access times

Memory

a. One-word-wide memory organization

b. Wide memory organization

Memory

Multiplexor

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization

• #clock cycles to send the address (say 1)• #clock cycles to initiate each DRAM access (say 15)• #clock cycles to transfer a word of data (say 1)

Clock cycles required to access 4 words:

1 + 4x15 + 4x1 1 + 1x15 + 4x11 + 1x15 + 1

Improving performance

Two ways of improving performance:

decreasing the miss ratio: associativity decreasing the miss penalty: multilevel caches

Decreasing miss ratio with associativity

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Tag Data

One-way set associative(direct mapped)

Tag Data

Two-way set associative

Tag Data

2 blocks / set

4 blocks / set

8 blocks / set

4-way set-associative cacheAddress

V TagIndex

254255

Data V Tag Data V Tag Data V Tag Data

4-to-1 multiplexor

Hit Data

123891011123031 0

Tag size versus associativity

Cache of 4K blocks, four word block size (or four-word cache lines), and 32-bit addresses

Direct mapped Byte offset = 4 bits (each block = 4 words = 16 bytes) Index + Tag = 32 – 4 = 28 bits For 4K blocks, 12 index bits are required #Tag bits for each block = 28 – 12 = 16 Total #Tag bits = 16 x 4 = 64Kbits

4-way set-associative #Sets = 1K, therefore 10 bits index bits are required #Tag bits for each block = 28 – 10 = 18 Total #Tag bits = 4 x 18 x 1K = 72Kbits

Block replacement policy

In a direct mapped cache, when a miss occurs, the requested block can go only at one position.

In a set-associative cache, there can be multiple positions in a set for storing each block. If all the positions are filled, which block should be replaced?

Least Recently Used (LRU) Policy Randomly choose a block and replace it

Common framework for memory hierarchies

Q1: where can a block be placed?

note: a block is in this case a cache line

Direct mapped cache: one position n-way set-associative cache: n positions (typically 2-8) Fully associative: everywhere

Q2: how is a block found? Direct mapped: index part of address indicates entry n-way: use index to search in all the n cache blocks Fully associative: check all tags

Q3: wich block should be replaced on a miss? Direct mapped: no choice Associative caches: use replacement algorithm, like LRU

Q4: what happens on a write? write-through write-back on a write miss: allocate no-allocate

Understanding (cache) misses:The three Cs

Compulsory miss Capacity miss Conflict miss

cache size

miss rate

direct mapped (1-way)2-way

fully associative

compulsory

capacity

Reading

3rd edition of the textbook Chapter 7, Sections 7.1 – 7.3 and Section 7.5

2nd edition of the textbook Chapter 7, Sections 7.1 – 7.3 and Section 7.5

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy

Documents

CSE502: Computer Architecture Memory Hierarchy & Caches

Computer Hardware Review (Memory Hierarchy)cs3231/18s1/lectures/lect07.pdf · Computer Hardware Review (Memory Hierarchy) Chapter 1.4. Learning Outcomes • Understand the concepts

Computer Architecture Lec 4 – Memory Hierarchy Review

CS1104 Computer Organisaton - NUS Computing - Homecs1101x/2_resources/cs1104/chapter2.doc · Web viewDecimal Binary Gray code Decimal Binary Gray code 0 0000 0000 8 1000 1100 1 0001

Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Computer Architecture Lec 17 – Advanced Memory Hierarchy 2

Computer Hardware. Processing Binary Math Hierarchy

CS1104: Computer Organisation cs1104 Lecture 4: Logic Gates and Circuits cs1104

Functional Hierarchy of Computer Based Management Systems · 2017. 1. 30. · FUNCTIONAL HIERARCHY OF COMPUTER BASED MANAGEMENT SYSTEMS German D. Surguchev September 1976 Research

Computer organization memory hierarchy

CS 104 Computer Organization and Design...Computer Organization and Design Memory Hierarchy CS104: Memory Hierarchy [Adapted from A. Roth] 2 Memory Hierarchy • Basic concepts •

Computer Architecture Memory Hierarchy Lynn Choi Korea University

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 5 MIPS ISA & Assembly Language Programming

Computer Architecture Examples and Hierarchy By Cheuk Wong

CS1104 2001/02 Semester II Help Session IIA Performance Measures

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

Computer Architecture Lecture 21 Memory Hierarchy Design

Computer Architecture, Memory Hierarchy & Virtual Memory · PDF fileComputer Architecture, Memory Hierarchy & Virtual Memory Some diagrams from Computer Organization and Architecture

Advanced Memory Hierarchy Csci 221 – Computer System Architecture Lecture 10

CS1104: Computer Organisation cs1104 Lecture 2: Number Systems & Codes cs1104