QUIZ Ch - Tarleton State University• Read pp. 342-3 of text and write a short description of each type of ROM in notebook! • Read and understand examples 6.2 and 6.3 • Answer

QUIZ Ch.5

c. If an application needs to extend the representation size of integers from 32 to 64 bits,

which endianness is more appropriate?

1

c. If an application needs to extend the representation size of integers from 32 to 64 bits,

which endianness is more appropriate?

Answer:

Little endian, since no bytes need to be shifted.

2

QUIZ Ch.5

Hint: Use Example 5.7 from text:

3

4

5

6

For stack architecture, convert to postfix first!

7

Chapter 6 Memory

9

6.3 The Memory Hierarchy

10

This chapter is about the “system” memory:

registers, cache, main memory.

Virtual memory is studied in the OS course:

• VM is typically implemented using a hard drive; it

extends the address space from RAM to the hard

drive.

• VM provides more space (compared to MM),

whereas cache memory provides more speed

6.3 The Memory Hierarchy

Simplified diagram of

computer organization

11 Partially based on “CS Illuminated” by Dale and Lewis

HDD/SSD

CD/DVD

Drive

Tape

Unit

USB

Drive

External

HDD

Input-

Display

Controllers

12

• If needed data is in a register, CPU simply uses it locally

• Else CPU sends request to cache

– If data is in cache (a.k.a. cache hit), it is brought to

CPU over the BSB

– Else (a.k.a. cache miss) the main memory is queried

• If data is in MM, it is brought to CPU over the FSB

• Else the request goes to disk …

Once the data is located on level k+1 (e.g. MM), then the

data, and a number of its nearby data elements are

brought into level k (e.g. cache).

How the Memory Hierarchy Works

13

• A hit is when data is found at a given memory level.

• A miss is when it is not found.

• The hit rate is the percentage of time data is found at a given memory level.

• The miss rate is the percentage of time it is not. – Miss rate = 1 - hit rate.

• The hit time is the time required to access data at a given memory level.

• The miss penalty is the time required to process a miss = time it takes to replace a block of memory in cache + time it takes to deliver the data to CPU.

Memory Hierarchy Definitions

14

• An entire block of data is copied after a miss

because the principle of locality tells us that once an

element is accessed, it is likely that a nearby data

element will be needed soon.

• Three forms of locality:

– Temporal locality- Recently-accessed data elements

tend to be accessed again.

– Spatial locality - Accesses tend to cluster.

– Sequential locality - Instructions tend to be accessed

sequentially.

Locality

15

6.2 Types of Memory

There are two kinds of main memory:

• random access memory (RAM), a.k.a. R/W

memory

• read-only-memory (ROM)

16

6.2 Types of Memory

There are two kinds of main/cache memory:


memory


Why do we need ROM? Isn’t RAM better?

17

6.2 Types of Memory

There are two kinds of main/cache memory:


memory


Why do we need ROM? Isn’t RAM better?

A: RAM is volatile, ROM is non-volatile!

18

Types of RAM

DRAM:

– Consists of capacitors that slowly leak their

charge over time. Thus, a refresh has to be

performed every few milliseconds to prevent data

loss.

– Owing to its simple design (although it needs

extra circuitry for refresh):

• It is cheap

• It is dense (small footprint)

• It uses little power

19

Types of RAM

SRAM:

– Consists of transistor circuits (similar to D flip-flop)

– Owing to its more complex design:

• It is more expensive than DRAM

• It is less dense (larger footprint) than DRAM

• It uses more power than DRAM

– Owing to its transistor switches:

• It is faster than DRAM → used to build caches

• It doesn’t need to be refreshed

20

Types of ROM

• ROM

• PROM

• EPROM

• EEPROM

• Flash

21

Types of ROM

• ROM

• PROM

• EPROM

• EEPROM

• Flash

For next time:

Read pp.342-3 of text

and write a short

description of each

type of ROM in

notebook!

22

6.4 Cache Memory

• The purpose of cache memory is to speed up

accesses by storing recently used data closer to the

CPU (instead of MM).

• Although cache is much smaller than main memory,

its access time is a fraction of that of main memory.

How much faster?

23 Source: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/

(linked on webpage)

http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/









Source: http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

Most advanced CPU in the

x86 family currently used in

commercial desktop PCs

(Nehalem microarchitecture)

http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory



















25

6.4 Cache Memory

Unlike main memory, which is accessed by address,

cache is typically accessed by content; hence, it is

often called content addressable memory.

Because of this, cache memory does not scale well.

A single large cache memory takes longer to

search!

Source: http://en.wikipedia.org/wiki/Westmere_(microarchitecture)

http://en.wikipedia.org/wiki/Westmere_(microarchitecture)

http://en.wikipedia.org/wiki/Westmere_(microarchitecture)

27

Both cache and MM are divided into blocks of the

same size (e.g. 1 KB)

Problem: Cache address cannot be the same as MM

address (Why?)

The correspondence between cache blocks and MM

blocks is made by a cache mapping algorithm

(a.k.a. scheme):

– Direct

– Fully associative

– Set associative

How to find data in the cache?

28

• If the cache has N blocks, block X of MM maps to

cache block Y = X mod N.

• Modular arithmetic practice:

8 mod 3 =

42 mod 17 =

5 mod 8 =

• Example: If we have 10 blocks of cache, block 7 of

cache may hold blocks 7, 17, 27, 37, . . . of MM.

Direct mapped cache

29

• If the cache has N blocks, block X of MM maps to

cache block Y = X mod N.

• Modular arithmetic in binary, when N = 2k is trivial!

0110 1101 mod 100 =

1101 1110 mod 100 =

1110 1101 0010 mod 1000 =

Direct mapped cache

30

Direct mapped cache

Block X of main

memory maps to

cache block

Y = X mod N.

31

Direct mapped cache

Block X of main

memory maps to

cache block

Y = X mod N.

QUIZ: Which

cache block will

MM block 12310 be

mapped to?

32

Direct mapped cache

Since multiple MM

blocks “compete”

for the same cache

block, how do we

know which one is

actually in the

cache at a given

time?

33

Answer: We tag the cache block

with additional info

34

Consider a word-addressable MM consisting of 4

blocks, and a cache with 2 blocks.

Each block is 4 words (word length can be anything)

• Block 0 and 2 of MM map to Block 0 of cache

• Blocks 1 and 3 of MM map to Block 1 of cache.

Direct mapped cache – Example 6.1

35

Consider a word-addressable MM consisting of 4

blocks, and a cache with 2 blocks.

Each block is 4 words (word length can be anything)

• Block 0 and 2 of MM map to Block 0 of cache

• Blocks 1 and 3 of MM map to Block 1 of cache.

Show these mappings

with arrows on the

diagram:

Direct mapped cache – Example 6.1

36

Word-addressable MM consisting of 4 blocks,

and a cache with 2 blocks.

Each block is 4 words

How many bits are needed for a MM address?

Direct mapped cache – ex. 6.1

37

Word-addressable MM consisting of 4 blocks, and a

cache with 2 blocks.

Each block is 4 words

For mapping, we split the MM address into 3 fields:

• Each block is 4 words, so the offset field must have 2 bits

• There are 2 blocks in cache, so the block field must contain 1

bit

• This leaves 1 bit for the tag


38

Suppose we need to access

main memory address 316

(0011 in binary).

–Partition 0011 using the

address format from Figure a,

we get Figure b.

–Thus, the MM address 0011

maps to cache block 0.

– Figure c shows this mapping,

along with the tag that is also

stored with the data, in the

cache.

a

b

c


39


40

QUIZ: Show the mapping for MM address 0xE

41

Direct mapped cache maps MM blocks in a modular

fashion to cache blocks. The mapping depends on:

• The number of bits in the main memory address

(how many addresses exist in main memory)

• The number of blocks are in cache (which

determines the size of the block field)

• How many addresses (either bytes or words) are in

a block (which determines the size of the offset

field)

Direct mapped cache - review

To do for next time:

• Read pp. 342-3 of text and write a short

description of each type of ROM in

notebook!

• Read and understand examples 6.2 and 6.3

• Answer review questions 1- 10/390.

• Solve exercise 1/391

42

43

QUIZ: Types of ROM

• ROM

• PROM

• EPROM

• EEPROM

• Flash

Describe in one phrase

the difference between:

• ROM and PROM

• PROM and EPROM

• EPROM and EEPROM

• EEPROM and Flash

44

• ROM

• PROM

• EPROM

• EEPROM

• Flash

Source: http://en.wikipedia.org/wiki/Flash_memory

Flash memory was developed from EEPROM.

There are two main types of flash, which are named

after the NAND and NOR logic gates, b/c the

internal characteristics of the individual flash

memory cells exhibit characteristics similar to those

of the corresponding gates.

Whereas EEPROMs has to be completely erased

before being rewritten,

--NAND type flash may be written and read in

blocks (or pages) which are generally much smaller

than the entire device;

--NOR type allows a single machine word (byte)

to be written or read independently.

http://en.wikipedia.org/wiki/Flash_memory

45

EXAMPLE 6.2 Assume a byte-addressable main

memory consists of 214 bytes, cache has 16 blocks,

and each block has 8 bytes.

• The number of memory blocks: = 2K blocks

• Each MM address requires14 bits, of which:

– The rightmost 3 bits are the offset field

– We need log2(16) = 4 bits to select a specific block in

cache, so the block field consists of the middle 4 bits.

– The remaining 7 bits make up the tag field.

46

EXAMPLE 6.2 Assume a byte-addressable main

memory consists of 214 bytes, cache has 16 blocks,

and each block has 8 bytes.

• If brought into cache, which cache block will the byte

at address 0x1234 reside?

• What will the block’s tag be?

• What is the offset of the byte inside the block?


Read and understand examples 6.3 and

6.4.

47

QUIZ: Exercise 1/391

48


49


50


51


52


53

54

Conclusion on Direct Mapped cache:

Finding a block in the cache does not require any

searching: The block is either at the calculated

location or it is not.

Problem: Sometimes a block has to be removed

from the cache although there are other unused

blocks in the cache!

(a.k.a. victim block)

55

Conclusion on Direct Mapped cache:

Finding a block in the cache does not require any

searching: The block is either at the calculated

location or it is not.

Solution: The opposite extreme:

Finding a block in the cache requires searching the

entire cache!

56

• Instead of placing memory blocks in specific

cache locations based on memory address, we

allow a block to go anywhere in the cache!

• In this way, cache would have to fill up before

any blocks are evicted.

• A memory address is partitioned into only two

fields: the tag and the offset:

Fully Associative Cache

57

• We have 14-bit MM addresses and a cache with 16

blocks, each block of size 8. The field format of a

memory reference is:

• When the cache is searched, all tags are searched

in parallel to retrieve the data quickly.

• This requires special, costly hardware.

Example 6.2 – fully associative

58

• Each MM block can reside in any cache block, so

the entire cache needs to be searched.

• This can be done with parallel (a.k.a.

associative) algorithms, but they don’t scale

well.

• Moreover …


Downsides

59

• You will recall that direct mapped cache evicts a

block whenever another memory reference

needs that block.

• With fully associative cache, we have no such

mapping, thus we must devise an algorithm to

determine which block to evict from the cache.

• The block that is evicted is the victim block.

• There are a number of ways to pick a victim,

a.k.a. cache replacement policies (stay tuned).


Downsides

60

• Combines the ideas of direct mapped cache and fully

associative cache.

• An N-way set-associative cache mapping is like

direct mapped cache in that a memory reference

maps to a particular location in cache.

• Unlike direct mapped cache, the location is a set of

several cache blocks, similar to the fully associative

cache.

Conclusion: Instead of mapping anywhere in the

cache, a memory reference can map only to a

(sub)set of cache slots. The (sub)set is determined

with modulus arithmetic.

Set-Associative Cache

61

The number of cache blocks per set varies according to

overall system design. Example:


– A 2-way set associative cache can

be conceptualized as shown below.

– Each set contains two different

memory blocks.

Logical view Linear view

62

• A memory reference is divided into three

fields: tag, set, and offset.

• As with direct-mapped cache, the offset field

chooses the word within the cache block, and

the tag field uniquely identifies the memory

address.

• The set field determines the set to which the

memory block maps.


63

We are using 2-way set associative mapping with a word-

addressable main memory of 214 words and a cache

with 16 blocks, where each block contains 8 words.

– Cache has a total of 16 blocks, and each set has 2 blocks,

then there are 8 sets in cache.

– Thus, the set field is 3 bits, the offset field is 3 bits, and

the tag field is 8 bits.

EXAMPLE 6.5

64

We are using 2-way set associative mapping with a word-

addressable main memory of 214 words and a cache

with 16 blocks, where each block contains 8 words.

– How does the cache controller look for MM address

0x1234 ?

EXAMPLE 6.5

65

• With fully associative and set associative cache, a

replacement policy is invoked when it becomes

necessary to evict a block from cache.

• An optimal replacement policy would be able to look

into the future to see which blocks won’t be needed

for the longest period of time.

• Although it is impossible to implement an optimal

replacement algorithm, it is instructive to use it as a

benchmark for assessing the efficiency of any other

scheme we come up with.

6.4.2. Replacement policies

66

• With fully associative and set associative cache, a

replacement policy is invoked when it becomes

necessary to evict a block from cache.

6.4.2. Replacement policies

Why don’t we need a R.P.

for direct-mapped cache?

67

The replacement policy that we choose depends upon

the locality that we are trying to optimize

– usually, we are interested in temporal locality.

Least recently used (LRU) algorithm keeps track of

the last time that a block was assessed and evicts

the block that has been unused for the longest

period of time.

Disadvantage: complexity: LRU has to maintain access

history for each block → slows down the cache.

Replacement policies

68

First-in, first-out (FIFO) is a popular cache

replacement policy.

• The block that has been in the cache the longest,

regardless of when it was last used.

Random replacement policy

• Picks a block at random and replaces it with a new

block.

• It can evict a block that will be needed often or

needed soon, but it never thrashes.


69

First-in, first-out (FIFO) is a popular cache

replacement policy.

• The block that has been in the cache the longest,

regardless of when it was last used.

Random replacement policy

• Picks a block at random and replaces it with a new

block.

• It can evict a block that will be needed often or

needed soon, but it never thrashes.


The probability of thrashing is very small!

70

• The performance of hierarchical memory is

measured by its effective access time (EAT).

• EAT is a weighted average that takes into account

the hit ratio and relative access times of successive

levels of memory.

• The EAT for a two-level memory is given by:

EAT = H AccessC + (1-H) AccessMM.

where H is the cache hit rate and AccessC and AccessMM are

the access times for cache and main memory, respectively.

6.4.3 Cache performance

measures

71

A computer system has a MM access time of 200 ns

supported by a cache having a 10 ns access time

and a hit rate of 99%.

• Suppose access to cache and main memory

occurs concurrently. (The accesses overlap.)

• The EAT is:

0.99(10 ns) + 0.01(200 ns) = 9.9 ns + 2 ns = 11 ns.

Cache performance example

72


supported by a cache having a 10 ns access time


• What if the accesses do not overlap?

• The EAT is:

0.99(10 ns) + 0.01(10 ns + 200 ns)

= 9.9 ns + 2.01 ns = 12 ns.

Cache performance example

73


supported by a cache having a 6 ns access time.

The cache and MM accesses are non-overlapped.

We want the EAT to be 10 ns or under.

What is the minimum hit rate?

QUIZ: Cache performance


74


• Read pp. 354-367 of text

• Read and understand examples 6.3, 6.4, 6.6

• Answer review questions 11- 17/390.

• Solve exercises 4 and 7/392

75

76



The cache and MM accesses do not overlap.


• What does EAT stand for?

• What is the slowest cache we can use for this

system (i.e. the max cache access time)?


77



The cache and MM accesses do not overlap.


• EAT = Effective Access Time

• 0.98∙Tmax + 0.02∙(Tmax + 150 ns) = 10 ns

Tmax = 7 ns


QUIZ cache algorithms:

Exercise 14/363

78


Exercise 14/363

79


Exercise 14/363

80

In each case, show what the cache controller will do when the CPU places a

request for address A A C D F F

81




period of time.

Disadvantage: complexity: LRU has to maintain

access history for each block → slows down the

cache.

Remember from last time:

Cache replacement policies

82




period of time.

Disadvantage: complexity: LRU has to maintain

access history for each block → slows down the

cache.

p.333 of text: “There are ways to approximate LRU, but

that is beyond the scope of this book.”

Remember from last time:

Cache replacement policies

83

The LRU bit is the simplest implementation of the LRU

algorithm.

Example: We have a 2-way set-associative cache:

Approximating LRU with one bit

Used in the Intel Pentium,

see Section 6.6

Not in text, but required!

84

When there is a hit (or when the block is first brought in

the cache):

• its LRU bit is reset (made 0)

• the LRU bit of the other block in the set is set (made 1)

Approximating LRU with one bit


85

The cache has 3 sets, and it is initially empty.

The CPU is requesting the following sequence of blocks:

12, 1, 2, 1, 2, 42, 1, 2, 3, 4

QUIZ: LRU bit


86

12, 1, 2, 1, 2, 42, 1, 2, 3, 4

QUIZ: LRU bit


87

• Caching performance depends upon programs

exhibiting good locality.

– Some object-oriented programs have poor locality

owing to their complex, dynamic structures.

– Arrays stored in column-major rather than row-major

order can be problematic for certain cache

organizations.

• With poor locality, caching can actually cause

performance degradation rather than improvement!

6.4.4 When does caching

break down?

88

• Cache replacement policies must take into account

dirty blocks = blocks that have been updated while

they were in the cache.

• Dirty blocks must be written back to MM. A write policy

determines how/when this is done.

• There are two types of write policies:

– write through → Both cache and MM are

updated simultaneously on every write

– write back → MM is updated only when the

block is selected for replacement

Writing to the Cache

89

• write through

– Disadvantage: MM must be updated with each cache

write, which slows down the access time on updates.

• This slowdown is usually negligible, b/c the

majority of accesses are reads, not writes.

– Advantage: The MM stays always consistent with the

cache

• write back

– Advantage: MM traffic is minimized

– Disadvantage: A data value in MM is not always the

same with that value in the cache

• This may cause problems in systems with

concurrent users, esp. when each user has their

own cache.

The cache coherence problem [Source: http://en.wikipedia.org/wiki/Cache_coherence]

90

Not in text

http://en.wikipedia.org/wiki/Cache_coherence

//upload.wikimedia.org/wikipedia/commons/a/a1/Cache_Coherency_Generic.png

91

• The cache we have been discussing is called a

unified or integrated cache where both instructions

and data are cached.

• Many modern systems employ separate caches

for data and instructions.

– This is called a Harvard cache.

What to Cache?

92

Most programs:

– have higher instruction locality than data locality

– are not self-modifying (!)

It makes sense, then, to use separate caches,

each with its own size, mapping alg.,

replacement alg., write policy

Downside: greater complexity

– A larger unified cache provides about the same

performance improvement w/o introducing as

much complexity.

Why use separate caches?

93

• Cache performance can also be improved by adding

a small associative cache to hold blocks that have

been evicted recently.

– This is called a victim cache.

• A trace cache is a variant of the instruction cache:

– It holds decoded instructions for program branches, giving

the illusion that noncontiguous instructions are really

contiguous.

– Example: loops

– The Intel Pentium 4 used it!

What to Cache?

94

• Cache performance can also be improved by adding

a small associative cache to hold blocks that have

been evicted recently.

– This is called a victim cache.

• A trace cache is a variant of the instruction cache:

– It holds decoded instructions for program branches, giving

the illusion that noncontiguous instructions are really

contiguous.

– Example: loops

– The Intel Pentium 4 used it!

What to Cache?

Remember from Chs.4, 5:

• microoperations

• pipelining

• branch prediction

95

Trace Caches in Intel CPUs

http://www.anandtech.com/print/604

P6 march, a.k.a. i686 (Pentium Pro, Pentium II, III, Celeron, M, Xeon) had

separate L1 caches and L2, but no trace cache.

NetBurst march, a.k.a. P7 (i786?) (Pentium 4, D, Xeon)

The L1 instruction cache becomes a trace cache of size 12K micro-ops,

called Execution Trace Cache.

Not in text, FYI only

Patented by Alex Peleg

and Uri Weiser of Intel

Corp. in 1994.

Commercially available

in 2000-2006, starting

with the Willamette

core for the Pentium 4.

96 Source: http://www.xbitlabs.com/articles/cpu/display/nehalem-

microarchitecture_3.html%20for%20example%20discussion


Core march (Core 2, Xeon)

In addition to a classical L1 instruction cache, there is a trace

cache called Loop Stream Detector, which can hold 18

instructions. Detects only loops.

Nehalem march (Core i3, i5, i7)

The Loop Stream Detector is moved downstream of the

decoder, can hold 28 micro-ops. Still only loops.

http://www.xbitlabs.com/articles/cpu/display/nehalem-microarchitecture_3.html for example discussion



97

Trace Cache in Intel CPUs

Source: http://www.behardware.com/articles/815-2/intel-

core-i7-and-core-i5-lga-1155-sandy-bridge.html


Sandy Bridge march (second gen. Core i3, i5, i7)

Aside from the classical L1 instruction cache, there is a “L0” trace

cache of size 1.5 KB for mops. Can store any branches, not just loops.

http://www.behardware.com/articles/815-2/intel-core-i7-and-core-i5-lga-1155-sandy-bridge.html





















Most of today’s desktops and servers employ multilevel

cache hierarchies:

• L1 cache (8KB to 64KB) is situated on the processor

itself.

• L2 cache (64KB to 2MB) was initially on the

motherboard (part of the “chipset”, or on an expansion

card.

Multi-level Caches

Figure source:

http://www.karbosguide.com/hardware/module3b2.htm

http://www.karbosguide.com/hardware/module3b2.htm

99

Multi-level Caches

Figure source: http://www.elektronik-

kompendium.de/sites/com/0309291.htm

Figure source: http://www.tomshardware.co.uk/athlon-l3-cache,review-

31697-2.html

The trend today is for the L2 cache to

“migrate” onto the CPU chip, thereby having

two levels of cache on the chip itself.

The old, external cache was renamed L3 cache.

http://www.elektronik-kompendium.de/sites/com/0309291.htm




http://www.tomshardware.co.uk/athlon-l3-cache,review-31697-2.html









100

Once the number of cache levels is determined, the

next thing to consider is whether data (or

instructions) can exist in more than one cache level.

• Inclusive caches: the same data/instr. may be

present at multiple levels of cache.

– Strictly inclusive: all data/instr. in a smaller

cache (e.g. L1) must be present at the next

higher level (e.g. L2).

• Exclusive cache: permit only one copy of the data

(e.g. either in L1 or L2, not in both).

Multi-level Caches

101

6.5 Virtual Memory

• Cache memory enhances performance by providing

faster memory access speed.

• Virtual memory (VM) enhances performance by

providing greater memory capacity w/o adding more

MM.

• Instead, a portion of a disk drive serves as an

extension of MM.

• If a system uses paging, VM partitions the MM into

individually managed page frames, that are written

(or paged) to disk when they are not immediately

needed.

Similar to the cache

blocks

102

SKIP the rest of

6.5 Virtual Memory

Until …

103

• Another approach to virtual memory is the use of

segmentation.

• Instead of dividing memory into equal-sized pages,

virtual address space is divided into variable-length

segments, often under the control of the programmer.

• A segment is located through its entry in a segment

table, which contains the segment’s memory location

and a bounds limit that indicates its size.

• After a page fault, the operating system searches for a

location in memory large enough to hold the segment

that is retrieved from disk.

6.5.5 Segmentation

104

Both paging and segmentation can cause

fragmentation:

• Paging is subject to internal fragmentation because

a process may not need the entire range of addresses

contained within the page. Thus, there may be many

pages containing unused fragments of memory.

• Segmentation is subject to external fragmentation,

which occurs when contiguous chunks of memory

become broken up as segments are allocated and

deallocated over time.

Segmentation

105

• Consider a small

computer having 32K of

memory, with

segmentation.

• The memory segments of

two processes is shown

in the table at the right.

• The segments can be

allocated anywhere in

memory.

Segmentation example

106

• All of the segments of P1 and one of

the segments of P2 are loaded as

shown at the right.

• Segment S2 of process P2 requires

11K of memory, and there is only 1K

free, so it waits.


107

• Eventually, Segment 2 of Process 1

is no longer needed, so it is

unloaded giving 11K of free memory.

• But Segment 2 of Process 2 cannot

be loaded because the free memory

is not contiguous.


108

• Over time, the problem gets

worse, resulting in small

unusable blocks scattered

throughout physical memory.

• This is an example of external

fragmentation.

• Eventually, this memory is

recovered through compaction,

a.k.a. garbage collection and

the process starts over.


109

• Large page tables are cumbersome and slow, but with

its uniform memory mapping, page operations are

fast. Segmentation allows fast access to the segment

table, but segment loading is labor-intensive.

• Paging and segmentation can be combined to take

advantage of the best features of both by assigning

fixed-size pages within variable-sized segments.

• Each segment has a page table. This means that a

memory address will have three fields:

– one for the segment

– a second one for the page (within the segment)

– a third for the offset (within the page)

6.5.6 Paging and segmentation

110

6.6 Real-World Example

Early Intel Caches

As the x86 microprocessors reached clock rates of 20 MHz and above in

the Intel 386, small amounts of fast cache memory began to be

featured in systems to improve performance. This was because the

DRAM used for main memory had significant latency, up to 120 ns,

as well as refresh cycles.

• The cache was constructed from more expensive, but significantly

faster, SRAM, which at the time had latencies around 10 ns.

• The early caches were external to the processor and typically located

on the motherboard in the form of eight or nine DIP devices placed in

sockets to enable the cache as an optional extra or upgrade feature.

• Some versions could support 16 to 64 KB of external cache.

Source: http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors

Not in text

1985

http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors

111


Early Intel Caches

With the Intel 486, an 8 KB cache was integrated directly into the CPU

die.

• This cache was termed Level 1 or L1 cache to differentiate it from the

slower on-motherboard, or Level 2 (L2) cache.

• These on-motherboard caches were much larger, with the most

common size then being 256 KB.

Source: http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors

Not in text

1989

http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors

112


Pentium (P5) Caches

The P5 micro-architecture supports both paging and segmentation,

which can be used in various combinations (unpaged unsegmented,

segmented unpaged, unsegmented paged).

Two levels of cache (L1 and L2), both having block size of 32 B, and

using 2-way set-associative mapping.

• L1 cache:

– On the CPU chip

– Has two parts: instruction cache (I-cache) and a data cache (D-

cache).

• L2:

– External

– 512 KB to 1 MB

1993

113


Pentium (P5) Caches

114

Not in text Today’s processors have L1, L2, and L3

built on the die (chip) itself!

Source: http://www.extremetech.com/computing/77348-amd-unveils-barcelona-quadcore-details

115

Not in text Latest development: Package eDRAM

chip next to CPU!

Sources: http://forum.hwbot.org/showthread.php?t=73204&page=2 http://semimd.com/chipworks/2014/02/07/intels-e-dram-shows-up-in-

the-wild/ http://www.cinemablend.com/games/Wii-U-Memory-Bandwidth-GPU-More-Powerful-Than-We-Thought-62437.html

116

Intel’s latest

micro-

architecture:

Haswell

Not in text

• Micro-operation cache capable of

storing 1.5 K micro-ops (~6 KB)

produced by the decoders

• 14- to 19-stage instruction

pipeline, depending on the micro-

op cache hit or miss

Sources: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29

http://wccftech.com/article/exploring-amds-intels-architectural-philosophies-future-hold-part/

117

• Computer hierarchy → smallest, fastest memory at

the top, largest, slowest memory at the bottom.

• Cache memory gives faster access to MM

• VM uses disk storage to give the illusion of having a

large MM.

• Cache maps blocks of main memory to blocks of

cache memory.

• VM maps page frames to virtual pages.

• There are three general types of cache: Direct

mapped, fully associative and set associative.

Chapter 6 REVIEW

118

• With fully associative and set associative cache,

replacement policies must be established. E.g.,

LRU, FIFO, or LFU.

• Write policies → what to do with “dirty” (updated)

blocks.

• Segmentation → segments are variable-sized

units assigned to processes.

• Fragmentation:

– internal for paged memory

– external for segmented memory.

Chapter 6 REVIEW

119

Exercises 3, 5, 10

Due Tuesday lab, Dec.2 before review

Chapter 6 HOMEWORK

120

A byte-addressable memory uses 15 bits for addresses.

The cache has 64 blocks, of size 8 Byte each.

Show how the address is to be split into fields for a

direct-mapped cache .

Example 6.4

121



Explain what the cache controller does when the MM

address ABC7 is needed by the CPU.

Example 6.4

122



Is the figure on p.328 of the text wrong?

Address 1028 → 000010 000000 100

Example 6.4

Documents

QUIZ Ch - Tarleton State University• Read pp. 342-3 of text and write a short description of each type of ROM in notebook! • Read and understand examples 6.2 and 6.3 • Answer