04 Cache Memoryinfosec.pusan.ac.kr/.../2019/03/CH04-cache-memory.pdfTable 4.1 Key Characteristics of...

+ Chapter 4Cache Memory

Table 4.1

Key Characteristics of Computer Memory Systems

Location

Internal (e.g. processor registers, cache,

main memory)

External (e.g. optical disks, magnetic disks, tapes)

Capacity

Number of words

Number of bytes

Unit of Transfer

Block Access Method

Sequential

Direct

Random

Associative

Performance

Access time

Cycle time

Transfer rate Physical Type

Semiconductor

Magnetic

Optical

Magneto-optical

Physical Characteristics

Volatile/nonvolatile Erasable/nonerasable

Organization

Memory modules

+Characteristics of Memory

Systems

Location

Refers to whether memory is internal and external to the computer

Internal memory is often equated with main memory

Processor requires its own local memory, in the form of registers

Cache is another form of internal memory

External memory consists of peripheral storage devices that are accessible to the processor via I/O controllers

Capacity

Memory is typically expressed in terms of bytes

Unit of transfer

For internal memory the unit of transfer is equal to the number of electrical lines into and out of the memory module

Location of memory

Processor requires its own local memory( register)

Internal memory : the main memory

External memory : memory on peripheral storage devices

(disk, tape, etc.)

Memory Capacity

Memory capacity of internal memory is typically expressed

in terms of bytes or words

Common word lengths are 8, 16, and 32 bits

External memory capacity is typically expressed in terms of

Unit of Transfer of Memory

For internal memory,

The unit of transfer is equal to the word length, but is often larger,

such as 64, 128, or 256 bits

Usually governed by data bus width

For external memory,

The unit of transfer is usually a block which is much larger than a

For addressable unit,

The unit of transfer is the smallest location which can be uniquely

addressed

Word internally

Access Methods (1)

Sequential access

Start at the beginning and read through in order

Access time depends on location of data and previous location

e.g. tape

Direct access

Individual blocks have unique address

Access is by jumping to vicinity plus sequential search

Access time depends on location and previous location

e.g. disk

Access Methods (2)

Random access

Individual addresses identify locations exactly

Access time is independent of location or previous access

e.g. RAM

Associative

Data is located by a comparison with contents of a portion of the

Access time is independent of location or previous access

e.g. cache

Access Methods (3) : Summary

Sequential access

Memory is organized into units of data called

records

Access must be made in a specific linear

sequence

Access time is variable

Direct access

Involves a shared read-write mechanism

Individual blocks or records have a unique

address based on physical location

Access time is variable

Random access

Each addressable location in memory has a unique, physically wired-in addressing mechanism

The time to access a given location is

independent of the sequence of prior

accesses and is constant

Any location can be selected at random and directly addressed and

accessed

Main memory and some cache systems are

random access

Associative

A word is retrieved based on a portion of its contents rather than its

address

Each location has its own addressing mechanism

and retrieval time is constant independent of location or prior access

patterns

Cache memories may employ associative

access

Performance

Access time(latency)

For random access memory, this is the time it takes to perform a read or write operation

That is, the time between presenting the address and getting the valid data

For non-random access memory, access time is the time it takes to position the read-write mechanism at the desired location

Memory Cycle time

Memory cycle time = access time + any additional time required before a second access can commence

Additional time ?

The time of transient signal dies out on signal lines

The time to regenerate data if they are read destructively

Memory cycle time >= Access time

Performance

Transfer Rate (Bandwidth)

Rate at which data can be transferred into or out of a memory unit

For random access memory,

Transfer rate = (1/cycle time)

For non-random access memory,

the following relationship holds

TN=TA+(N/R)

TN : average time to read or write N bits

TA : average access time

N : number of bits

R : transfer rate (bps)

Physical Types

Semiconductor

Magnetic

Disk & Tape

Optical

CD & DVD

Others

Bubble (failed technology in 1980s)

Holographic data storage (new technology)

Tradeoffs in Memory Performance

For memory,

Faster access time, greater cost per bit

Greater capacity, smaller cost per bit

Greater capacity, slower access time

The system designer may confront a problem to meet

optimal solution

The way to solve this problem is “the memory hierarchy”

+ Memory The most common forms are:

Semiconductor memory

Magnetic surface memory

Optical

Magneto-optical

Several physical characteristics of data storage are important:

Volatile memory

Information decays naturally or is lost when electrical power is switched off

Nonvolatile memory

Once recorded, information remains without deterioration until deliberately changed

No electrical power is needed to retain information

Magnetic-surface memories

Are nonvolatile

Semiconductor memory

May be either volatile or nonvolatile

Nonerasable memory

Cannot be altered, except by destroying the storage unit

Semiconductor memory of this type is known as read-only memory (ROM)

For random-access memory the organization is a key design issue

Organization refers to the physical arrangement of bits to form words

+Memory Hierarchy

Design constraints on a computer’s memory can be summed

up by three questions:

How much, how fast, how expensive

There is a trade-off among capacity, access time, and cost

Faster access time, greater cost per bit

Greater capacity, smaller cost per bit

Greater capacity, slower access time

The way out of the memory dilemma is not to rely on a single

memory component or technology, but to employ a memory

hierarchy

Memory Hierarchy

Registers

In CPU

Internal or Main memory

May include one or more levels of cache

“RAM”

External memory

Backing store

Memory Hierarchy - Diagram

- Decreasing cost per bit- Increasing capacity

- Increasing access time- Decreasing frequency of accessof the memory by the processor

Hierarchy List

Registers

L1 Cache

L2 Cache

Main memory

Disk cache

Optical

So do you want fast memory architecture ?

It is possible to build a computer which uses only static RAM

(see later)

This would be very fast

This would need no cache

This would cost a very large amount

Locality of Reference

During the course of the execution of a program, memory

references tend to cluster (group !)

e.g. loops

Small amount of fast memory

Sits between normal main memory and CPU

May be located on CPU chip or module

T1 + T2

1Fraction of accesses involving only Level 1 (Hit ratio)

Figure 4.2 Performance of a Simple Two-Level Memory

+Memory

The use of three levels exploits the fact that semiconductor memory comes in a variety of types which differ in speed and cost

Data are stored more permanently on external mass storage devices

External, nonvolatile memory is also referred to as secondary memory or auxiliary memory

Disk cache

A portion of main memory can be used as a buffer to hold data temporarily that is to be read out to disk

A few large transfers of data can be used instead of many small transfers of data

Data can be retrieved rapidly from the software cache rather than slowly from the disk

Word Transfer

Fastest FastLessfast

Block Transfer

Cache Main Memory

Figure 4.3 Cache and Main Memory

(a) Single cache

(b) Three-level cache organization

CPULevel 1

(L1) cache

Level 2

(L2) cache

Level 3

(L3) cache

Memory

Cache operation – overview

CPU requests contents of memory location

Check cache for this data

If present, get from cache (fast)

If not present, read required block from main memory to

Then deliver from cache to CPU

Cache includes tags to identify which block of main memory

is in each cache slot

Memory

address

C – 1

2n – 1

Length

Block Length

(K Words)

Block 0

(K words)

Block M – 1

Number Tag Block

(b) Main memory

(a) Cache

Figure 4.4 Cache/Main-Memory Structure

Receive address

RA from CPU

Is block

containing RA

in cache?

Fetch RA word

and deliver

to CPU

Access main

memory for block

containing RA

Allocate cache

line for main

memory block

Deliver RA word

to CPU

Load main

memory block

into cache line

Figure 4.5 Cache Read Operation

Read address (RA)

Cache Design Parameters

Mapping Function

Replacement Algorithm

Write Policy

Block Size

Number of Caches

Size does matter

More cache is expensive

More cache is faster (up to a point)

Cf. The larger the cache, the larger the number of gates involved in

addressing the cache. The result is that large caches tend to be

slightly slower than small ones.

Cache Addressing Where does cache sit?

Between processor and virtual memory management unit

Between MMU and main memory

Logical cache (virtual cache) stores data using virtual addresses

Stores data using virtual addresses

The processor accesses the cache directly, without going through the MMU.

Cf. a physical cache stores data using main memory physical addresses

Advantage of the logical cache is that cache access faster than for a physical

cache, before MMU address translation

Disadvantage has to do with the fact that most virtual memory systems supply

each application with the same virtual memory address space.

That is, each application sees a virtual memory that starts at address 0.

Thus, the same virtual address in two different applications refers to two different

physical addresses.

The cache memory must therefore be completely flushed with each application

context switch, or extra bits must be added to each line of the cache to identify with

virtual address space this address refers to!

Cache Addressing

processor

memory

Logical address Physical address

< Logical cache >

processor

memory

< Physical cache >

Processor Cache

Address

buffer

Control

Figure 4.6 Typical Cache Organization

Control

Table 4.2

Elements of Cache Design

Cache Addresses

Logical

Physical

Cache Size Mapping Function

Direct

Associative

Set Associative

Replacement Algorithm

Least recently used (LRU)

First in first out (FIFO) Least frequently used (LFU)

Random

Write Policy

Write through

Write back

Line Size Number of caches

Single or two level

Unified or split

+Cache Addresses

Virtual memory

Facility that allows programs to address memory from a logical

point of view, without regard to the amount of main memory

physically available

When used, the address fields of machine instructions contain

virtual addresses

For reads to and writes from main memory, a hardware memory

management unit (MMU) translates each virtual address into a

physical address in main memory

Virtual Memory

Processor Main

memoryCache

(a) Logical Cache

Processor Main

memoryCache

(b) Physical Cache

Figure 4.7 Logical and Physical Caches

Table 4.3

Cache Sizes of

Processors

a Two values separated by

a slash refer to instruction

and data caches.

b Both caches are

instruction only; no data

caches.

(Table can be found on

in the textbook.)

Processor Type Year of

Introduction L1 Cachea L2 cache L3 Cache

IBM 360/85 Mainframe 1968 16 to 32 kB — —

PDP-11/70 Minicomputer 1975 1 kB — —

VAX 11/780 Minicomputer 1978 16 kB — —

IBM 3033 Mainframe 1978 64 kB — —

IBM 3090 Mainframe 1985 128 to 256 kB — —

Intel 80486 PC 1989 8 kB — —

Pentium PC 1993 8 kB/8 kB 256 to 512 KB —

PowerPC 601 PC 1993 32 kB — —

PowerPC 620 PC 1996 32 kB/32 kB — —

PowerPC G4 PC/server 1999 32 kB/32 kB 256 KB to 1 MB 2 MB

IBM S/390 G6 Mainframe 1999 256 kB 8 MB —

Pentium 4 PC/server 2000 8 kB/8 kB 256 KB —

IBM SP High-end

server/

supercomputer

2000 64 kB/32 kB 8 MB —

CRAY MTAb Supercomputer 2000 8 kB 2 MB —

Itanium PC/server 2001 16 kB/16 kB 96 KB 4 MB

Itanium 2 PC/server 2002 32 kB 256 KB 6 MB

IBM POWER5

High-end server

2003 64 kB 1.9 MB 36 MB

CRAY XD-1 Supercomputer 2004 64 kB/64 kB 1MB —

IBM POWER6

PC/server 2007 64 kB/64 kB 4 MB 32 MB

IBM z10 Mainframe 2008 64 kB/128 kB 3 MB 24-48 MB

Intel Core i7 EE 990

Workstaton/

server 2011 6 ´ 32 kB/32 kB 1.5 MB 12 MB

IBM zEnterprise

Mainframe/

Server 2011

24 ´ 64 kB/

128 kB 24 ´ 1.5 MB

24 MB L3

192 MB

Mapping Function

Because there are fewer cache lines than main memory

blocks, an algorithm is needed for mapping main memory

blocks into cache lines

Three techniques can be used:

Direct

• The simplest technique

• Maps each block of main memory into only one possible cache line

Associative

• Permits each main memory block to be loaded into any line of the cache

• The cache control logic interprets a memory address simply as a Tag and a Word field

• To determine whether a block is in the cache, the cache control logic must simultaneously examine every line’s Tag for a match

Set Associative

• A compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages

Mapping Function

Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines

Also, a means is needed for determining which main memory block currently occupies a cache line

Mostly, three techniques can be used:

Direct

Associative

Set associative

Mapping Function

For designing the cache, we will use the following

specification

Cache of 64kByte

Cache block of 4 bytes

i.e. cache is 16k (214) lines of 4 bytes

16MBytes main memory

24 bit address

(224=16M)

Direct Mapping

Each block of main memory maps to only one cache line

i.e. if a block is in cache, it must be in one specific place

Address is in two parts

Least Significant w bits identify unique word within a block of

main memory

Most Significant s bits specify one of the 2s blocks of main

memory

The cache logic interprets these s bits as a tag of s-r bits (most

significant) & line field of r bits

Direct Mapping Summary

Address length = (s + w) bits

Number of addressable units = 2s+w words or bytes

Block size = line size = 2w words or bytes

Number of blocks in main memory

= (2s+ w)/(2w)= 2s

Number of lines in cache = 2r=m

Size of tag = (s – r) bits

Direct Mapping

Address Structure

Tag s-r Line or Slot r Word w

8 14 2

24 bit address

2 bit word identifier (4 byte block)

22 bit block identifier

8 bit tag (=22-14)

14 bit slot or line

No two blocks in the same line have the same Tag field

Check contents of cache by finding line and checking Tag

Direct Mapping from Cache to Main Memory

Direct Mapping

Cache Line Table The mapping is expressed as

i = j modulo m

i : cache line number

j : main memory block number

m : number of lines in the cache

Cache line (i) Main Memory blocks assigned

0 0, m, 2m, 3m…2s-m

1 1,m+1, 2m+1…2s-m+1

… …

m-1m-1, 2m-1,3m-1…2s-1

Direct Mapping Cache

Organization

Direct

Mapping

Example

i = j modulo mi : cache line number

m : number of lines in the cache

• In cache, tag(s-r) || line number (r)

•In 16K line cache(1) Tag(16) & line number (0001) => 16 || (0001*4) = 16 || 0004 in main memory (2) Tag(FF) & line number (3FFE) => FF || (3FFE * 4) = FF || (EFF8) in main memory

Direct Mapping pros & cons

Simple

Inexpensive

Fixed location for given block

If a program accesses 2 blocks that map to the same line

repeatedly, cache misses are very high

Victim Cache

An extension to a direct mapped cache that adds a small, secondary, fully associative cache to store cache blocks that have been ejected from the main cache due to a capacityor conflict miss.

Lower miss penalty

Remember what was discarded

Already fetched

Use again with little penalty

Fully associative

4 to 16 cache lines

Between direct mapped L1 cache and next memory level

04 Cache Memoryinfosec.pusan.ac.kr/.../2019/03/CH04-cache-memory.pdfTable 4.1 Key Characteristics of...

Documents

Introduction to Scientific Programming using GP … › files › CoursesDev › public › 2014 › ...(Device) Grid L2 cache Global Memory Shared Memory Threads Registers Threads

Horngrenima14e ch04

Chase Ch04

Maximise Application Performance - Fujitsu...Instruction-level parallelism with SIMD instructions Improvement of computing efficiency using Expanded registers Improvement of cache

Extended Memory Controller and the MPAX registers And Cache Multicore programming and Applications February 19, 2013

Ch04 outline

Excel ch04

The History of I/O Models - MIT OpenCourseWare · Exa or zetta (right after peta) Disk Memory Hierarchies in Practice 750 ps . CPU . Registers . Level 1 Cache . Level 2 Cache …

1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)

ICS233 Lecture2 Slides B - KFUPM Slides.pdf · Program Counter Instruction Registers Instruction Cache Next Program Counter Data Cache Control Lecture Slides on Computer Arch & Assembly

Hubbard Ch04

Price/performance Modern Memory Hierarchy · Price/performance Modern Memory Hierarchy Registers Level 1 Cache Level 2 Cache CPU Chip DRAM Chips Mechanical Disk Tape CPU Year: 2010

MGT420 Ch04

Kendall7e ch04

Ch04 slides

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape

CISC 7310X C11: Mass Storagehuichen-cs.github.io/course/CISC7310X/18SP/lecture/cisc...Cache Registers ~10ms ~10ns ~2ns ~1ns ~TB ~GB ~MB

Hrm10e ch04

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary

StephanvanSchaik CristianoGiuﬀrida HerbertBos … · Motivation CPU Registers L1 Cache L2 Cache L3 Cache Main Memory Disk Storage Faster Smaller Larger Slower 2/39