Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
QUIZ Ch.5
c. If an application needs to extend the representation size of integers from 32 to 64 bits,
which endianness is more appropriate?
1
c. If an application needs to extend the representation size of integers from 32 to 64 bits,
which endianness is more appropriate?
Answer:
Little endian, since no bytes need to be shifted.
2
QUIZ Ch.5
Hint: Use Example 5.7 from text:
3
4
5
6
For stack architecture, convert to postfix first!
7
Chapter 6 Memory
9
6.3 The Memory Hierarchy
10
This chapter is about the “system” memory:
registers, cache, main memory.
Virtual memory is studied in the OS course:
• VM is typically implemented using a hard drive; it
extends the address space from RAM to the hard
drive.
• VM provides more space (compared to MM),
whereas cache memory provides more speed
6.3 The Memory Hierarchy
Simplified diagram of
computer organization
11 Partially based on “CS Illuminated” by Dale and Lewis
HDD/SSD
CD/DVD
Drive
Tape
Unit
USB
Drive
External
HDD
Input-
Display
Controllers
12
• If needed data is in a register, CPU simply uses it locally
• Else CPU sends request to cache
– If data is in cache (a.k.a. cache hit), it is brought to
CPU over the BSB
– Else (a.k.a. cache miss) the main memory is queried
• If data is in MM, it is brought to CPU over the FSB
• Else the request goes to disk …
Once the data is located on level k+1 (e.g. MM), then the
data, and a number of its nearby data elements are
brought into level k (e.g. cache).
How the Memory Hierarchy Works
13
• A hit is when data is found at a given memory level.
• A miss is when it is not found.
• The hit rate is the percentage of time data is found at a given memory level.
• The miss rate is the percentage of time it is not. – Miss rate = 1 - hit rate.
• The hit time is the time required to access data at a given memory level.
• The miss penalty is the time required to process a miss = time it takes to replace a block of memory in cache + time it takes to deliver the data to CPU.
Memory Hierarchy Definitions
14
• An entire block of data is copied after a miss
because the principle of locality tells us that once an
element is accessed, it is likely that a nearby data
element will be needed soon.
• Three forms of locality:
– Temporal locality- Recently-accessed data elements
tend to be accessed again.
– Spatial locality - Accesses tend to cluster.
– Sequential locality - Instructions tend to be accessed
sequentially.
Locality
15
6.2 Types of Memory
There are two kinds of main memory:
• random access memory (RAM), a.k.a. R/W
memory
• read-only-memory (ROM)
16
6.2 Types of Memory
There are two kinds of main/cache memory:
• random access memory (RAM), a.k.a. R/W
memory
• read-only-memory (ROM)
Why do we need ROM? Isn’t RAM better?
17
6.2 Types of Memory
There are two kinds of main/cache memory:
• random access memory (RAM), a.k.a. R/W
memory
• read-only-memory (ROM)
Why do we need ROM? Isn’t RAM better?
A: RAM is volatile, ROM is non-volatile!
18
Types of RAM
DRAM:
– Consists of capacitors that slowly leak their
charge over time. Thus, a refresh has to be
performed every few milliseconds to prevent data
loss.
– Owing to its simple design (although it needs
extra circuitry for refresh):
• It is cheap
• It is dense (small footprint)
• It uses little power
19
Types of RAM
SRAM:
– Consists of transistor circuits (similar to D flip-flop)
– Owing to its more complex design:
• It is more expensive than DRAM
• It is less dense (larger footprint) than DRAM
• It uses more power than DRAM
– Owing to its transistor switches:
• It is faster than DRAM → used to build caches
• It doesn’t need to be refreshed
20
Types of ROM
• ROM
• PROM
• EPROM
• EEPROM
• Flash
21
Types of ROM
• ROM
• PROM
• EPROM
• EEPROM
• Flash
For next time:
Read pp.342-3 of text
and write a short
description of each
type of ROM in
notebook!
22
6.4 Cache Memory
• The purpose of cache memory is to speed up
accesses by storing recently used data closer to the
CPU (instead of MM).
• Although cache is much smaller than main memory,
its access time is a fraction of that of main memory.
How much faster?
23 Source: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/
(linked on webpage)
Source: http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory
Most advanced CPU in the
x86 family currently used in
commercial desktop PCs
(Nehalem microarchitecture)
25
6.4 Cache Memory
Unlike main memory, which is accessed by address,
cache is typically accessed by content; hence, it is
often called content addressable memory.
Because of this, cache memory does not scale well.
A single large cache memory takes longer to
search!
Source: http://en.wikipedia.org/wiki/Westmere_(microarchitecture)
27
Both cache and MM are divided into blocks of the
same size (e.g. 1 KB)
Problem: Cache address cannot be the same as MM
address (Why?)
The correspondence between cache blocks and MM
blocks is made by a cache mapping algorithm
(a.k.a. scheme):
– Direct
– Fully associative
– Set associative
How to find data in the cache?
28
• If the cache has N blocks, block X of MM maps to
cache block Y = X mod N.
• Modular arithmetic practice:
8 mod 3 =
42 mod 17 =
5 mod 8 =
• Example: If we have 10 blocks of cache, block 7 of
cache may hold blocks 7, 17, 27, 37, . . . of MM.
Direct mapped cache
29
• If the cache has N blocks, block X of MM maps to
cache block Y = X mod N.
• Modular arithmetic in binary, when N = 2k is trivial!
0110 1101 mod 100 =
1101 1110 mod 100 =
1110 1101 0010 mod 1000 =
Direct mapped cache
30
Direct mapped cache
Block X of main
memory maps to
cache block
Y = X mod N.
31
Direct mapped cache
Block X of main
memory maps to
cache block
Y = X mod N.
QUIZ: Which
cache block will
MM block 12310 be
mapped to?
32
Direct mapped cache
Since multiple MM
blocks “compete”
for the same cache
block, how do we
know which one is
actually in the
cache at a given
time?
33
Answer: We tag the cache block
with additional info
34
Consider a word-addressable MM consisting of 4
blocks, and a cache with 2 blocks.
Each block is 4 words (word length can be anything)
• Block 0 and 2 of MM map to Block 0 of cache
• Blocks 1 and 3 of MM map to Block 1 of cache.
Direct mapped cache – Example 6.1
35
Consider a word-addressable MM consisting of 4
blocks, and a cache with 2 blocks.
Each block is 4 words (word length can be anything)
• Block 0 and 2 of MM map to Block 0 of cache
• Blocks 1 and 3 of MM map to Block 1 of cache.
Show these mappings
with arrows on the
diagram:
Direct mapped cache – Example 6.1
36
Word-addressable MM consisting of 4 blocks,
and a cache with 2 blocks.
Each block is 4 words
How many bits are needed for a MM address?
Direct mapped cache – ex. 6.1
37
Word-addressable MM consisting of 4 blocks, and a
cache with 2 blocks.
Each block is 4 words
For mapping, we split the MM address into 3 fields:
• Each block is 4 words, so the offset field must have 2 bits
• There are 2 blocks in cache, so the block field must contain 1
bit
• This leaves 1 bit for the tag
Direct mapped cache – ex. 6.1
38
Suppose we need to access
main memory address 316
(0011 in binary).
–Partition 0011 using the
address format from Figure a,
we get Figure b.
–Thus, the MM address 0011
maps to cache block 0.
– Figure c shows this mapping,
along with the tag that is also
stored with the data, in the
cache.
a
b
c
Direct mapped cache – ex. 6.1
39
Direct mapped cache – ex. 6.1
40
QUIZ: Show the mapping for MM address 0xE
41
Direct mapped cache maps MM blocks in a modular
fashion to cache blocks. The mapping depends on:
• The number of bits in the main memory address
(how many addresses exist in main memory)
• The number of blocks are in cache (which
determines the size of the block field)
• How many addresses (either bytes or words) are in
a block (which determines the size of the offset
field)
Direct mapped cache - review
To do for next time:
• Read pp. 342-3 of text and write a short
description of each type of ROM in
notebook!
• Read and understand examples 6.2 and 6.3
• Answer review questions 1- 10/390.
• Solve exercise 1/391
42
43
QUIZ: Types of ROM
• ROM
• PROM
• EPROM
• EEPROM
• Flash
Describe in one phrase
the difference between:
• ROM and PROM
• PROM and EPROM
• EPROM and EEPROM
• EEPROM and Flash
44
• ROM
• PROM
• EPROM
• EEPROM
• Flash
Source: http://en.wikipedia.org/wiki/Flash_memory
Flash memory was developed from EEPROM.
There are two main types of flash, which are named
after the NAND and NOR logic gates, b/c the
internal characteristics of the individual flash
memory cells exhibit characteristics similar to those
of the corresponding gates.
Whereas EEPROMs has to be completely erased
before being rewritten,
--NAND type flash may be written and read in
blocks (or pages) which are generally much smaller
than the entire device;
--NOR type allows a single machine word (byte)
to be written or read independently.
45
EXAMPLE 6.2 Assume a byte-addressable main
memory consists of 214 bytes, cache has 16 blocks,
and each block has 8 bytes.
• The number of memory blocks: = 2K blocks
• Each MM address requires14 bits, of which:
– The rightmost 3 bits are the offset field
– We need log2(16) = 4 bits to select a specific block in
cache, so the block field consists of the middle 4 bits.
– The remaining 7 bits make up the tag field.
46
EXAMPLE 6.2 Assume a byte-addressable main
memory consists of 214 bytes, cache has 16 blocks,
and each block has 8 bytes.
• If brought into cache, which cache block will the byte
at address 0x1234 reside?
• What will the block’s tag be?
• What is the offset of the byte inside the block?
To do for next time:
Read and understand examples 6.3 and
6.4.
47
QUIZ: Exercise 1/391
48
QUIZ: Exercise 1/359
49
QUIZ: Exercise 1/359
50
QUIZ: Exercise 1/359
51
QUIZ: Exercise 1/359
52
QUIZ: Exercise 1/359
53
54
Conclusion on Direct Mapped cache:
Finding a block in the cache does not require any
searching: The block is either at the calculated
location or it is not.
Problem: Sometimes a block has to be removed
from the cache although there are other unused
blocks in the cache!
(a.k.a. victim block)
55
Conclusion on Direct Mapped cache:
Finding a block in the cache does not require any
searching: The block is either at the calculated
location or it is not.
Solution: The opposite extreme:
Finding a block in the cache requires searching the
entire cache!
56
• Instead of placing memory blocks in specific
cache locations based on memory address, we
allow a block to go anywhere in the cache!
• In this way, cache would have to fill up before
any blocks are evicted.
• A memory address is partitioned into only two
fields: the tag and the offset:
Fully Associative Cache
57
• We have 14-bit MM addresses and a cache with 16
blocks, each block of size 8. The field format of a
memory reference is:
• When the cache is searched, all tags are searched
in parallel to retrieve the data quickly.
• This requires special, costly hardware.
Example 6.2 – fully associative
58
• Each MM block can reside in any cache block, so
the entire cache needs to be searched.
• This can be done with parallel (a.k.a.
associative) algorithms, but they don’t scale
well.
• Moreover …
Fully Associative Cache
Downsides
59
• You will recall that direct mapped cache evicts a
block whenever another memory reference
needs that block.
• With fully associative cache, we have no such
mapping, thus we must devise an algorithm to
determine which block to evict from the cache.
• The block that is evicted is the victim block.
• There are a number of ways to pick a victim,
a.k.a. cache replacement policies (stay tuned).
Fully Associative Cache
Downsides
60
• Combines the ideas of direct mapped cache and fully
associative cache.
• An N-way set-associative cache mapping is like
direct mapped cache in that a memory reference
maps to a particular location in cache.
• Unlike direct mapped cache, the location is a set of
several cache blocks, similar to the fully associative
cache.
Conclusion: Instead of mapping anywhere in the
cache, a memory reference can map only to a
(sub)set of cache slots. The (sub)set is determined
with modulus arithmetic.
Set-Associative Cache
61
The number of cache blocks per set varies according to
overall system design. Example:
Set-Associative Cache
– A 2-way set associative cache can
be conceptualized as shown below.
– Each set contains two different
memory blocks.
Logical view Linear view
62
• A memory reference is divided into three
fields: tag, set, and offset.
• As with direct-mapped cache, the offset field
chooses the word within the cache block, and
the tag field uniquely identifies the memory
address.
• The set field determines the set to which the
memory block maps.
Set-Associative Cache
63
We are using 2-way set associative mapping with a word-
addressable main memory of 214 words and a cache
with 16 blocks, where each block contains 8 words.
– Cache has a total of 16 blocks, and each set has 2 blocks,
then there are 8 sets in cache.
– Thus, the set field is 3 bits, the offset field is 3 bits, and
the tag field is 8 bits.
EXAMPLE 6.5
64
We are using 2-way set associative mapping with a word-
addressable main memory of 214 words and a cache
with 16 blocks, where each block contains 8 words.
– How does the cache controller look for MM address
0x1234 ?
EXAMPLE 6.5
65
• With fully associative and set associative cache, a
replacement policy is invoked when it becomes
necessary to evict a block from cache.
• An optimal replacement policy would be able to look
into the future to see which blocks won’t be needed
for the longest period of time.
• Although it is impossible to implement an optimal
replacement algorithm, it is instructive to use it as a
benchmark for assessing the efficiency of any other
scheme we come up with.
6.4.2. Replacement policies
66
• With fully associative and set associative cache, a
replacement policy is invoked when it becomes
necessary to evict a block from cache.
6.4.2. Replacement policies
Why don’t we need a R.P.
for direct-mapped cache?
67
The replacement policy that we choose depends upon
the locality that we are trying to optimize
– usually, we are interested in temporal locality.
Least recently used (LRU) algorithm keeps track of
the last time that a block was assessed and evicts
the block that has been unused for the longest
period of time.
Disadvantage: complexity: LRU has to maintain access
history for each block → slows down the cache.
Replacement policies
68
First-in, first-out (FIFO) is a popular cache
replacement policy.
• The block that has been in the cache the longest,
regardless of when it was last used.
Random replacement policy
• Picks a block at random and replaces it with a new
block.
• It can evict a block that will be needed often or
needed soon, but it never thrashes.
Replacement policies
69
First-in, first-out (FIFO) is a popular cache
replacement policy.
• The block that has been in the cache the longest,
regardless of when it was last used.
Random replacement policy
• Picks a block at random and replaces it with a new
block.
• It can evict a block that will be needed often or
needed soon, but it never thrashes.
Replacement policies
The probability of thrashing is very small!
70
• The performance of hierarchical memory is
measured by its effective access time (EAT).
• EAT is a weighted average that takes into account
the hit ratio and relative access times of successive
levels of memory.
• The EAT for a two-level memory is given by:
EAT = H AccessC + (1-H) AccessMM.
where H is the cache hit rate and AccessC and AccessMM are
the access times for cache and main memory, respectively.
6.4.3 Cache performance
measures
71
A computer system has a MM access time of 200 ns
supported by a cache having a 10 ns access time
and a hit rate of 99%.
• Suppose access to cache and main memory
occurs concurrently. (The accesses overlap.)
• The EAT is:
0.99(10 ns) + 0.01(200 ns) = 9.9 ns + 2 ns = 11 ns.
Cache performance example
72
A computer system has a MM access time of 200 ns
supported by a cache having a 10 ns access time
and a hit rate of 99%.
• What if the accesses do not overlap?
• The EAT is:
0.99(10 ns) + 0.01(10 ns + 200 ns)
= 9.9 ns + 2.01 ns = 12 ns.
Cache performance example
73
A computer system has a MM access time of 100 ns
supported by a cache having a 6 ns access time.
The cache and MM accesses are non-overlapped.
We want the EAT to be 10 ns or under.
What is the minimum hit rate?
QUIZ: Cache performance
To do for next time:
74
To do for next time:
• Read pp. 354-367 of text
• Read and understand examples 6.3, 6.4, 6.6
• Answer review questions 11- 17/390.
• Solve exercises 4 and 7/392
75
76
A computer system has a MM access time of 150 ns
and a hit rate of 98%.
The cache and MM accesses do not overlap.
We want the EAT to be 10 ns or under.
• What does EAT stand for?
• What is the slowest cache we can use for this
system (i.e. the max cache access time)?
QUIZ: Cache performance
77
A computer system has a MM access time of 150 ns
and a hit rate of 98%.
The cache and MM accesses do not overlap.
We want the EAT to be 10 ns or under.
• EAT = Effective Access Time
• 0.98∙Tmax + 0.02∙(Tmax + 150 ns) = 10 ns
Tmax = 7 ns
QUIZ: Cache performance
QUIZ cache algorithms:
Exercise 14/363
78
QUIZ cache algorithms:
Exercise 14/363
79
QUIZ cache algorithms:
Exercise 14/363
80
In each case, show what the cache controller will do when the CPU places a
request for address A A C D F F
81
Least recently used (LRU) algorithm keeps track of
the last time that a block was assessed and evicts
the block that has been unused for the longest
period of time.
Disadvantage: complexity: LRU has to maintain
access history for each block → slows down the
cache.
Remember from last time:
Cache replacement policies
82
Least recently used (LRU) algorithm keeps track of
the last time that a block was assessed and evicts
the block that has been unused for the longest
period of time.
Disadvantage: complexity: LRU has to maintain
access history for each block → slows down the
cache.
p.333 of text: “There are ways to approximate LRU, but
that is beyond the scope of this book.”
Remember from last time:
Cache replacement policies
83
The LRU bit is the simplest implementation of the LRU
algorithm.
Example: We have a 2-way set-associative cache:
Approximating LRU with one bit
Used in the Intel Pentium,
see Section 6.6
Not in text, but required!
84
When there is a hit (or when the block is first brought in
the cache):
• its LRU bit is reset (made 0)
• the LRU bit of the other block in the set is set (made 1)
Approximating LRU with one bit
Not in text, but required!
85
The cache has 3 sets, and it is initially empty.
The CPU is requesting the following sequence of blocks:
12, 1, 2, 1, 2, 42, 1, 2, 3, 4
QUIZ: LRU bit
Not in text, but required!
86
12, 1, 2, 1, 2, 42, 1, 2, 3, 4
QUIZ: LRU bit
Not in text, but required!
87
• Caching performance depends upon programs
exhibiting good locality.
– Some object-oriented programs have poor locality
owing to their complex, dynamic structures.
– Arrays stored in column-major rather than row-major
order can be problematic for certain cache
organizations.
• With poor locality, caching can actually cause
performance degradation rather than improvement!
6.4.4 When does caching
break down?
88
• Cache replacement policies must take into account
dirty blocks = blocks that have been updated while
they were in the cache.
• Dirty blocks must be written back to MM. A write policy
determines how/when this is done.
• There are two types of write policies:
– write through → Both cache and MM are
updated simultaneously on every write
– write back → MM is updated only when the
block is selected for replacement
Writing to the Cache
89
• write through
– Disadvantage: MM must be updated with each cache
write, which slows down the access time on updates.
• This slowdown is usually negligible, b/c the
majority of accesses are reads, not writes.
– Advantage: The MM stays always consistent with the
cache
• write back
– Advantage: MM traffic is minimized
– Disadvantage: A data value in MM is not always the
same with that value in the cache
• This may cause problems in systems with
concurrent users, esp. when each user has their
own cache.
The cache coherence problem [Source: http://en.wikipedia.org/wiki/Cache_coherence]
90
Not in text
91
• The cache we have been discussing is called a
unified or integrated cache where both instructions
and data are cached.
• Many modern systems employ separate caches
for data and instructions.
– This is called a Harvard cache.
What to Cache?
92
Most programs:
– have higher instruction locality than data locality
– are not self-modifying (!)
It makes sense, then, to use separate caches,
each with its own size, mapping alg.,
replacement alg., write policy
Downside: greater complexity
– A larger unified cache provides about the same
performance improvement w/o introducing as
much complexity.
Why use separate caches?
93
• Cache performance can also be improved by adding
a small associative cache to hold blocks that have
been evicted recently.
– This is called a victim cache.
• A trace cache is a variant of the instruction cache:
– It holds decoded instructions for program branches, giving
the illusion that noncontiguous instructions are really
contiguous.
– Example: loops
– The Intel Pentium 4 used it!
What to Cache?
94
• Cache performance can also be improved by adding
a small associative cache to hold blocks that have
been evicted recently.
– This is called a victim cache.
• A trace cache is a variant of the instruction cache:
– It holds decoded instructions for program branches, giving
the illusion that noncontiguous instructions are really
contiguous.
– Example: loops
– The Intel Pentium 4 used it!
What to Cache?
Remember from Chs.4, 5:
• microoperations
• pipelining
• branch prediction
95
Trace Caches in Intel CPUs
http://www.anandtech.com/print/604
P6 march, a.k.a. i686 (Pentium Pro, Pentium II, III, Celeron, M, Xeon) had
separate L1 caches and L2, but no trace cache.
NetBurst march, a.k.a. P7 (i786?) (Pentium 4, D, Xeon)
The L1 instruction cache becomes a trace cache of size 12K micro-ops,
called Execution Trace Cache.
Not in text, FYI only
Patented by Alex Peleg
and Uri Weiser of Intel
Corp. in 1994.
Commercially available
in 2000-2006, starting
with the Willamette
core for the Pentium 4.
96 Source: http://www.xbitlabs.com/articles/cpu/display/nehalem-
microarchitecture_3.html%20for%20example%20discussion
Not in text, FYI only
Core march (Core 2, Xeon)
In addition to a classical L1 instruction cache, there is a trace
cache called Loop Stream Detector, which can hold 18
instructions. Detects only loops.
Nehalem march (Core i3, i5, i7)
The Loop Stream Detector is moved downstream of the
decoder, can hold 28 micro-ops. Still only loops.
97
Trace Cache in Intel CPUs
Source: http://www.behardware.com/articles/815-2/intel-
core-i7-and-core-i5-lga-1155-sandy-bridge.html
Not in text, FYI only
Sandy Bridge march (second gen. Core i3, i5, i7)
Aside from the classical L1 instruction cache, there is a “L0” trace
cache of size 1.5 KB for mops. Can store any branches, not just loops.
Most of today’s desktops and servers employ multilevel
cache hierarchies:
• L1 cache (8KB to 64KB) is situated on the processor
itself.
• L2 cache (64KB to 2MB) was initially on the
motherboard (part of the “chipset”, or on an expansion
card.
Multi-level Caches
Figure source:
http://www.karbosguide.com/hardware/module3b2.htm
99
Multi-level Caches
Figure source: http://www.elektronik-
kompendium.de/sites/com/0309291.htm
Figure source: http://www.tomshardware.co.uk/athlon-l3-cache,review-
31697-2.html
The trend today is for the L2 cache to
“migrate” onto the CPU chip, thereby having
two levels of cache on the chip itself.
The old, external cache was renamed L3 cache.
100
Once the number of cache levels is determined, the
next thing to consider is whether data (or
instructions) can exist in more than one cache level.
• Inclusive caches: the same data/instr. may be
present at multiple levels of cache.
– Strictly inclusive: all data/instr. in a smaller
cache (e.g. L1) must be present at the next
higher level (e.g. L2).
• Exclusive cache: permit only one copy of the data
(e.g. either in L1 or L2, not in both).
Multi-level Caches
101
6.5 Virtual Memory
• Cache memory enhances performance by providing
faster memory access speed.
• Virtual memory (VM) enhances performance by
providing greater memory capacity w/o adding more
MM.
• Instead, a portion of a disk drive serves as an
extension of MM.
• If a system uses paging, VM partitions the MM into
individually managed page frames, that are written
(or paged) to disk when they are not immediately
needed.
Similar to the cache
blocks
102
SKIP the rest of
6.5 Virtual Memory
Until …
103
• Another approach to virtual memory is the use of
segmentation.
• Instead of dividing memory into equal-sized pages,
virtual address space is divided into variable-length
segments, often under the control of the programmer.
• A segment is located through its entry in a segment
table, which contains the segment’s memory location
and a bounds limit that indicates its size.
• After a page fault, the operating system searches for a
location in memory large enough to hold the segment
that is retrieved from disk.
6.5.5 Segmentation
104
Both paging and segmentation can cause
fragmentation:
• Paging is subject to internal fragmentation because
a process may not need the entire range of addresses
contained within the page. Thus, there may be many
pages containing unused fragments of memory.
• Segmentation is subject to external fragmentation,
which occurs when contiguous chunks of memory
become broken up as segments are allocated and
deallocated over time.
Segmentation
105
• Consider a small
computer having 32K of
memory, with
segmentation.
• The memory segments of
two processes is shown
in the table at the right.
• The segments can be
allocated anywhere in
memory.
Segmentation example
106
• All of the segments of P1 and one of
the segments of P2 are loaded as
shown at the right.
• Segment S2 of process P2 requires
11K of memory, and there is only 1K
free, so it waits.
Segmentation example
107
• Eventually, Segment 2 of Process 1
is no longer needed, so it is
unloaded giving 11K of free memory.
• But Segment 2 of Process 2 cannot
be loaded because the free memory
is not contiguous.
Segmentation example
108
• Over time, the problem gets
worse, resulting in small
unusable blocks scattered
throughout physical memory.
• This is an example of external
fragmentation.
• Eventually, this memory is
recovered through compaction,
a.k.a. garbage collection and
the process starts over.
Segmentation example
109
• Large page tables are cumbersome and slow, but with
its uniform memory mapping, page operations are
fast. Segmentation allows fast access to the segment
table, but segment loading is labor-intensive.
• Paging and segmentation can be combined to take
advantage of the best features of both by assigning
fixed-size pages within variable-sized segments.
• Each segment has a page table. This means that a
memory address will have three fields:
– one for the segment
– a second one for the page (within the segment)
– a third for the offset (within the page)
6.5.6 Paging and segmentation
110
6.6 Real-World Example
Early Intel Caches
As the x86 microprocessors reached clock rates of 20 MHz and above in
the Intel 386, small amounts of fast cache memory began to be
featured in systems to improve performance. This was because the
DRAM used for main memory had significant latency, up to 120 ns,
as well as refresh cycles.
• The cache was constructed from more expensive, but significantly
faster, SRAM, which at the time had latencies around 10 ns.
• The early caches were external to the processor and typically located
on the motherboard in the form of eight or nine DIP devices placed in
sockets to enable the cache as an optional extra or upgrade feature.
• Some versions could support 16 to 64 KB of external cache.
Source: http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors
Not in text
1985
111
6.6 Real-World Example
Early Intel Caches
With the Intel 486, an 8 KB cache was integrated directly into the CPU
die.
• This cache was termed Level 1 or L1 cache to differentiate it from the
slower on-motherboard, or Level 2 (L2) cache.
• These on-motherboard caches were much larger, with the most
common size then being 256 KB.
Source: http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors
Not in text
1989
112
6.6 Real-World Example
Pentium (P5) Caches
The P5 micro-architecture supports both paging and segmentation,
which can be used in various combinations (unpaged unsegmented,
segmented unpaged, unsegmented paged).
Two levels of cache (L1 and L2), both having block size of 32 B, and
using 2-way set-associative mapping.
• L1 cache:
– On the CPU chip
– Has two parts: instruction cache (I-cache) and a data cache (D-
cache).
• L2:
– External
– 512 KB to 1 MB
1993
113
6.6 Real-World Example
Pentium (P5) Caches
114
Not in text Today’s processors have L1, L2, and L3
built on the die (chip) itself!
Source: http://www.extremetech.com/computing/77348-amd-unveils-barcelona-quadcore-details
115
Not in text Latest development: Package eDRAM
chip next to CPU!
Sources: http://forum.hwbot.org/showthread.php?t=73204&page=2 http://semimd.com/chipworks/2014/02/07/intels-e-dram-shows-up-in-
the-wild/ http://www.cinemablend.com/games/Wii-U-Memory-Bandwidth-GPU-More-Powerful-Than-We-Thought-62437.html
116
Intel’s latest
micro-
architecture:
Haswell
Not in text
• Micro-operation cache capable of
storing 1.5 K micro-ops (~6 KB)
produced by the decoders
• 14- to 19-stage instruction
pipeline, depending on the micro-
op cache hit or miss
Sources: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
http://wccftech.com/article/exploring-amds-intels-architectural-philosophies-future-hold-part/
117
• Computer hierarchy → smallest, fastest memory at
the top, largest, slowest memory at the bottom.
• Cache memory gives faster access to MM
• VM uses disk storage to give the illusion of having a
large MM.
• Cache maps blocks of main memory to blocks of
cache memory.
• VM maps page frames to virtual pages.
• There are three general types of cache: Direct
mapped, fully associative and set associative.
Chapter 6 REVIEW
118
• With fully associative and set associative cache,
replacement policies must be established. E.g.,
LRU, FIFO, or LFU.
• Write policies → what to do with “dirty” (updated)
blocks.
• Segmentation → segments are variable-sized
units assigned to processes.
• Fragmentation:
– internal for paged memory
– external for segmented memory.
Chapter 6 REVIEW
119
Exercises 3, 5, 10
Due Tuesday lab, Dec.2 before review
Chapter 6 HOMEWORK
120
A byte-addressable memory uses 15 bits for addresses.
The cache has 64 blocks, of size 8 Byte each.
Show how the address is to be split into fields for a
direct-mapped cache .
Example 6.4
121
A byte-addressable memory uses 15 bits for addresses.
The cache has 64 blocks, of size 8 Byte each.
Explain what the cache controller does when the MM
address ABC7 is needed by the CPU.
Example 6.4
122
A byte-addressable memory uses 15 bits for addresses.
The cache has 64 blocks, of size 8 Byte each.
Is the figure on p.328 of the text wrong?
Address 1028 → 000010 000000 100
Example 6.4