Upload
intelinsideoc
View
9
Download
0
Embed Size (px)
DESCRIPTION
cs2071 new notes
Citation preview
1
CMageshKumar_AP_AIHT CS2071_Computer Architecture
ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103 DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING
CS2071 COMPUTER ARCHITECTURE
Faculty: C.MAGESHKUMAR Class: IV EIE A&BSemester: VII
UNIT IV –MEMORY SYSTEM
CONTENT Page no.
I. Review of digital design 2
1. Signals, logic operators and gates 2
2. Gates as control element 2
3. Combinational circuits 4
4. Programmable combinational parts 4
5. Sequential circuits 5
II. Main memory concepts 7
1. Memory – definition 7
2. Memory hierarchy 7
3. Memory performance parameters 9
4. Memory structure and memory cycle, Memory chip organization 9
5. Hitting memory wall, Pipelined memory and interleaved memory 11
III. Types of memory 12
1. Types 12
2. Static RAM 12
3. Dynamic RAM 12
4. Other types 12
IV. Cache memory organization 12
1. Cache memory & need for cache 12
2. Basic cache terms, Design parameters of cache memory 13
3. What makes the cache work? 14
4. Cache organization (mapping) 15
5. Cache performance measure 16
6. Cache and main memory 16
7. Cache coherency 17
V. Secondary storage(Mass Memory Concepts) 17
VI. Virtual memory and paging 17
2
CMageshKumar_AP_AIHT CS2071_Computer Architecture
REVIEW OF DIGITAL DESIGN
1. SIGNALS, LOGIC OPERATORS AND GATES:
Signals:
All information elements in digital computers including instructions, numbers, and symbols are
encoded as electronic signals that are almost always two-valued. (binary).
Binary signals can be represented by the presence or absence of some electrical property such as voltage,
current, field or charge.
Signals 1.Analogsignal (continuous signal)
2. Digital signal (binary signal)
Circuits:
Combinational digital circuit (memoryless circuit) ex: multiplexer, decode, encoder
Sequential digital circuits ( circuit with memory) ex: latches, flip-flops, register
Logic operators:
Variations in Gate Symbols
Gates with more than two inputs and/or with inverted signals at input or output.
2. GATES AS CONTROL ELEMENTS:
Tristate buffer:
whose output is equal to data input when control signal is asserted (declared) and assumes an
indeterminate value when “e” is de-asserted.
Used to effectively isolate output from input.
An AND gate and a tristate buffer act as controlled switches or valves. An inverting buffer is
logically the same as a NOT gate.
0‟s 1‟s
off On
False True
Low High
Negative Positive
x y
AND Name XOR OR NOT
Graphical symbol
x y
Operator sign and alternate(s)
x y x y xy
x y
x
x or x _
x y or xy Arithmetic expression
x y 2xy x y xy 1 x
Output is 1 iff:
Input is 0 Both inputs
are 1s At least one
input is 1 Inputs are not equal
OR NOR NAND AND XNOR
3
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Wired OR and Bus Connections:
Wired OR allows tying together of several controlled signals.
Control/Data Signals and Signal Bundles:
Arrays of logic gates represented
by a single gate symbol.
Designing Gate Networks
AND-OR, NAND-NAND, OR-AND, NOR-NOR
Logic optimization: cost, speed, power dissipation
A two-level AND-OR circuit and two equivalent circuits are shown below
BCD-to-Seven-Segment Decoder:
The logic circuit that generates the enable signal for the lowermost segment (number 3) in a seven-
segment display unit.
Enable/Pass signal e
Data in x
Data out x or 0
Data in x
Enable/Pass signal e
Data out x or “high impedance”
(a) AND gate for controlled transfer (b) Tristate buffer
(c) Model for AND switch.
x
e
No data or x
0
1 x
e
ex
0
1
0
(d) Model for tristate buffer.
e
e
e Data out (x, y, z,
or high impedance)
(b) Wired OR of t ristate outputs
e
e
e
Data out (x, y, z, or 0)
(a) Wired OR of product terms
z
x
y
z
x
y
z
x
y
z
x
y
/ 8
/
8 / 8
Compl
/ 32
/ k
/ 32
Enable
/ k
/ k
/ k
(b) 32 AND gates (c) k XOR gates (a) 8 NOR gates
(a) AND-OR circuit
z
x
y
x
y
z
(b) Intermediate circuit
(c) NAND-NAND equivalent
z
x
y
x
y
z
z
x
y
x
y
z
x 3
x 2
x 1
x 0
Signals to enable or turn on the segments
4-bit input in [0, 9]
e 0
e 5
e 6
e 4
e 2
e 1
e 3
1
2 4
5
0
3
6
4
CMageshKumar_AP_AIHT CS2071_Computer Architecture
3. COMBINATIONAL (MEMORYLESS) CIRCUITS : (COMBINATIONAL PARTS)
High-level building blocks
Much like prefab parts used in building a house
Arithmetic components (adders, multipliers, ALUs)
examples for combinational part are:
multiplexers,
decoders/demultiplexers,
encoders
Multiplexer (MUX): (many to one)
Multiplexer (mux), or selector, allows one of several inputs to be selected and routed to output
depending on the binary value of a set of selection or address signals provided to it.
Decoders/Demultiplexers
A decoder allows the selection of one of 2a options using an a-bit address as input. A demultiplexer
(demux) is a decoder that only selects an output if its enable signal is asserted.
4. PROGRAMMABLE COMBINATIONAL PARTS
A programmable combinational part can do the job of many gates or gate networks.
To avoid having to use large number of small-scale integrated circuits for implementing Boolean
function of several variables.
Programmed by cutting existing connections (fuses) or establishing new connections (antifuses)
Programmable ROM (PROM)
Programmable connections and their use in a PROM are shown below
Programmable array logic (PAL): when OR array has fixed connections but the inputs to
AND gates can be programmed.
Programmable logic array (PLA): when both AND and OR arrays are programmed.
Programmable combinational logic: general structure and two classes known as PAL and
PLA devices. Not shown is PROM with fixed AND array (a decoder) and programmable
OR array is shown below
x
x
y
z
1
0
x
x
z
y
x
x
y
z
1
0
y
/ 32
/ 32
/ 32 1
0
1
0
3
2
z
y 1 0
1
0
1
0
y 1
y 0
y 0
(a) 2-to-1 mux (b) Switch view (c) Mux symbol
(d) Mux array (e) 4-to-1 mux with enable (e) 4-to-1 mux design
0
1
y
1 1
1
0
0 0
x
x x
x
1
0
2
3
x
x
x
x
0
1
2
3
z
e (Enable)
y 1
y 0
x 0
x 3
x 2
x 1
1
0
3
2
y 1
y 0
x 0
x 3
x 2
x 1 e
1
0
3
2
y 1
y 0
x 0
x 3
x 2
x 1
(a) 2-to-4 decoder (b) Decoder symbol
(c) Demultiplexer, or decoder with “enable”
(Enable)
. . .
.
.
.
Inputs
Outputs
(a) Programmable OR gates
w
x
y
z
(b) Logic equivalent of part a
w
x
y
z
(c) Programmable read-only memory (PROM)
De
cod
er
5
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Timing and Circuit Considerations:
Changes in gate/circuit output, triggered by changes in its inputs, are not instantaneous
Gate delay (δ): a fraction nanoseconds, delay time taken by the gate to give the output after giving
the input
Wire delay, previously negligible, is now important (electronic signals travel about 15 cm per ns)
Circuit simulation to verify function and timing
CMOS Transmission Gates:
A CMOS transmission gate
and its use in building a 2-to-1 mux.
5. SEQUENTIAL CIRCUITS (WITH MEMORY) (NOTE:Please Refer Page No.28 – 34,Chapter 2 in B.Parhami,”Computer Architecture” for Detailed
Description)
A programmable sequential part contain gates and memory elements
Programmed by cutting existing connections (fuses) or establishing new connections
(antifuses)
Design of sequential circuits exhibiting memory requires the use of storage elements
capable of holding information (a single bit) can be set to „1‟ or reset to „0‟
Programmable array logic (PAL)
Field-programmable gate array (FPGA)
Both types contain macrocells and interconnects
Latches, Flip-Flops, and Registers
AND
array
(AND
plane)
OR
array
(OR
plane)
. . .
. . .
.
.
.
Inputs
Outputs
(a) General programmable
combinational logic
(b) PAL: programmable
AND array, fixed OR array
8-input
ANDs
(c) PLA: programmable
AND and OR arrays
6-input
ANDs
4-input
ORs
z
x
x
0
1
(a) CMOS transmission gate: circuit and symbol
(b) Two-input mux built of two transmission gates
TG
TG TG
y
P
N
R Q
Q S
D
Q
Q
C
Q
Q
D
C
(a) SR latch (b) D latch
Q
C
Q
D
Q
C
Q
D
(e) k -bit register (d) D flip-flop symbol (c) Master-slave D flip-flop
Q
C
Q
D FF
/
/
k
k
Q
C
Q
D FF
R
S
6
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Sequential Machine Implementation (Hardware realization of Moore and Mealy sequential machines)
DESIGNING SEQUENTIAL CIRCUITS:
Useful Sequential Parts:
High-level building blocks
Much like prefab closets used in building a house
Other memory components will be
SRAM details, DRAM, Flash
Here we cover three useful parts:
shift register, register file (SRAM basics), counter
(NOTE: PLEASE REFER CHAPTER 1 AND CHAPTER 2 IN B.PARHAMI,”COMPUTER
ARCHITECTURE” FOR DETAILED DESCRIPTIONS)
Next-state logic
State register / n
/ m
/ l
Inputs Outputs
Next-state excitation signals
Present state
Output logic
Only for Mealy machine
Output
Q C
Q
D
e
Inputs
Q C
Q
D
Q C
Q
D
FF2
FF1
FF0
q
d
Decod
er
/ k
/ k
/
h
Write enable
Read address 0
Read address 1
Read data 0
Write data
Read enable
2 k -bit registers h / k
/ k
/ k
/ k
/ k
/ k
/ k
/ h
Write address
Muxes
Read data 1
/
k
/
h
/
h
/
h
/
k
/
h
Write enable
Read addr 0
/
k
/
k
Read addr 1
Write data Write addr
Read data 0
Read enable
Read data 1
(a) Register file with random access
(b) Graphic symbol for register file
Q C
Q
D
FF
/ k
Q C
Q
D
FF
Q C
Q
D
FF
Q C
Q
D
FF
/
k
Push
/
k
Input
Output Pop
Full
Empty
(c) FIFO symbol
7
CMageshKumar_AP_AIHT CS2071_Computer Architecture
II. MAINMEMORY CONCEPTS:
1. MEMORY – DEFINITION
Memory:
Memory refers to a physical device used to store programs or data on a temporary or permanent basis for
use in a computer or other electronic device.
Memory cell:
A memory cell is capable of storing one bit of information. It is usually organized in the form of an array.
Components of the Memory System
• Main memory: fast, random access, expensive, located close (but not inside) the CPU and used to store
program and data which are currently manipulated by the CPU.
• Secondary memory: slow, cheap, direct access, located remotely from the CPU.
2. MEMORY HIERARCHY
The Need for a Memory Hierarchy:
To match memory speed with processor speed
o Memory holding the program must be accessible in nanoseconds or less.
The widening speed gap between CPU and main memory
o Processor operations take of the order of 1 ns
o Memory access requires 10s or even 100s of ns
Memory bandwidth limits the instruction execution rate
o Each instruction executed involves at least one memory access. Hence, a few to 100s of MIPS is
the best that can be achieved.
o A fast buffer memory can help bridge the CPU-memory gap
o The fastest memories are expensive and thus not very large
Problems with the Memory System
What do we need?
We need memory to fit very large programs and towork at a speed comparable to that of themicroprocessors.
Main problem:
- microprocessors are working at a very high rateand they need large memories;
- memories are much slower than microprocessors;
Facts:
- the larger a memory, the slower it is;
- the faster the memory, the greater the cost/bit.
A Solution:
It is possible to build a composite memory system which combines a small, fast memory and a large slow
main memory and which behaves (most of the time) like a large fast memory.
The two level principle above can be extended into a hierarchy of many levels including the secondary
memory (disk store).
The effectiveness of such a memory hierarchy is based on property of programs called the principle of
locality
8
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Some typical characteristics:
1. Processor registers:
- 32 registers of 32 bits each = 128 bytes
- access time = few nanoseconds
2. On-chip cache memory:
- capacity = 8 to 32 Kbytes
- access time = ~10 nanoseconds
3. Off-chip cache memory:
- capacity = few hundred Kbytes
- access time = tens of nanoseconds
4. Main memory:
- capacity = tens of Mbytes
- access time = ~100 nanoseconds
5. Hard disk:
- capacity = few Gbytes
- access time = tens of milliseconds
The key to the success of a memory hierarchyis if dataand instructions can be distributed across the
memoryso that most of the time they are available, when needed,on the top levels of the hierarchy.
• The data which is held in the registers is under thedirect control of the compiler or of the assembler
programmer.
• The contents of the other levels of the hierarchy aremanaged automatically:
- Migration of data/instructions to and fromcaches is performed under hardware control;
- Migration between main memory and backupstore is controlled by the operating system
(withhardware support).
9
CMageshKumar_AP_AIHT CS2071_Computer Architecture
3. MEMORY PERFORMANCE PARAMETERS(Refer page no. 167 & 168 in Xerox)
Access methods: sequential access and random access
Performance: Access time, memory cycle time, transfer rate
4. MEMORY STRUCTURE AND MEMORY CYCLE, MEMORY CHIP ORGANIZATION
(With addition to the below notes & pictures, also refer page number 175 to 191 in Xerox)
SRAM:
Basically large array of storage cells that are accessed like registers
SRAM memory cell requires 4-6 transistors / bit
SRAM holds the stored data as long as it is powered on.
These storage cells are edge triggered D-flip flops
Limitations of flip flops:
o Adds complexity to cells
o Only fewer cells can be mounted on chip.
So, Latches are used instead of flip-flops but it will take more time write/read
Memory Structure and SRAM(page no. 317 in B.Parhami)
Conceptual inner structure of a 2hg SRAM chip and its shorthand representation is shown below
SRAM with Bidirectional Data Bus
When data input and output of an SRAM chip are shared or connected to a bidirectional data bus, output
must be disabled during write operations.
/ h
Write enable / g
Data in
Address
Data out
Chip select
Q
C
Q
D
FF
Q
C
Q
D
FF
Q
C
Q
D
FF
/
g
Output enable
1
0
2 –1 h
Address decoder
Storage cells
/
g
/
g
/
g
WE
CS
OE
D in D out
Addr
.
.
.
/ h
/
g
Write enable
Data in/out
Chip select
Output enable
Address Data in Data out
10
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Multiple-Chip SRAM
Eight 128K 8 SRAM chips forming a 256K 32 memory unit is shown below
DRAM and Refresh Cycles
DRAM:
Stores data as electric charge on tiny capacitor, that is accessed by MOS transistor
When word line is asserted (declared),
o to write:
low voltage on bit line causes capacitor to discharged. i.e., bit = „0‟
high voltage on bit line causes capacitor to charged. i.e., bit = „1‟
o to read
read operation takes in 2 steps:
step1:row is accessed
step2: column selection
bit line is prefetched first to halfway voltage and sensed by sense amplifier.
Reading operation destroys the content, so a write operation is enabled after reading. This is
also called destructive readout
Single-transistor DRAM cell, which is considerably simpler than SRAM cell, leads to dense, high-capacity
DRAM memory chips.
/
WE
CS
OE
D in D out
Addr
WE
CS
OE
D in D out
Addr
WE
CS
OE
D in D out
Addr
WE
CS
OE
D in D out
Addr
WE
CS
OE
D in D out
Addr
WE
CS
OE
D in D out
Addr
WE
CS
OE
D in D out
Addr
18
/
17
32 WE
CS
OE
D in D out
Addr
Data in
Data out, byte 3
Data out, byte 2
Data out, byte 1
Data out, byte 0
MSB
Address
Word line
Capacitor
Bit line
Pass transistor
Word line
Bit line
Compl. bit line
Vcc
(a) DRAM cell (b) Typical SRAM cell
11
CMageshKumar_AP_AIHT CS2071_Computer Architecture
DRAM Refresh Cycles and Refresh Rate:
o Variations in the voltage across a DRAM cell capacitor after writing a 1 and subsequent refresh
operations
o Leakage of charge causes (tiny capacitor) data to be erased after fraction of second due to discharging
nature of capacitor. So DRAM should be periodically refreshed.
o Refreshing: write operation is enabled when capacitor charge value nears the threshold voltage.
DRAM Packaging:
24-pin dual in-line package (DIP) : Typical DRAM package housing a 16M 4 memory
MEMORY CYCLE:(Refer page number 182, 183 in xerox)
5. HITTING MEMORY WALL, PIPELINED MEMORY AND INTERLEAVED MEMORY
Hitting the Memory Wall:
Memory density and capacity have grown
along with the CPU power and complexity,
but memory speed has not kept pace.
Bridging the CPU-Memory Speed Gap
Two ways of using a wide-access memory to bridge the speed gap between the processor and memory.
Idea: Retrieve more data from memory with each access
Time
Threshold voltage
0 Stored
1 Written Refreshed Refreshed Refreshed
10s of ms before needing refresh cycle
Voltage for 1
Voltage for 0
Legend:
Ai CAS Dj NC OE RAS WE
1 2 3 4 5 6 7 8 9 10 11 12
24 23 22 21 20 19 18 17 16 15 14 13
A4 A5 A6 A7 A8 A9 D3 D4 CAS OE Vss Vss
A0 A1 A2 A3 A10 D1 D2 RAS WE Vcc Vcc NC
Address bit i Column address strobe Data bit j No connection Output enable Row address strobe Write enable
1990 1980 2000 2010
1
10
10
Re
lati
ve p
erf
orm
ance
Calendar year
Processor
Memory
3
6
Wide-access
memory
.
.
.
Narrow bus to
processor Mux
Wide-access
memory
. . .
Wide bus to
processor
.
.
.
Mux
(a) Buffer and mult iplexer at the memory side
(a) Buffer and mult iplexer at the processor side
. . .
12
CMageshKumar_AP_AIHT CS2071_Computer Architecture
PIPELINED MEMORY AND INTERLEAVED MEMORY:
(Refer page no. 325 in text book B.Parhami)
Memory latency may involve other supporting operationsbesides the physical access itself
o Virtual-to-physical address translation
o Tag comparison to determine cache hit/miss
Pipelined cache memory is shown below
Memory Interleaving:
o Interleaved memory is more flexible than wide-access memory in that it can handle multiple
independent accesses at once.
III. TYPES OF MEMORY (Refer page no. 171 to 175, 193 to 196, 200 to 209 in Xerox)
1. TYPES
2. Static RAM
3. Dynamic RAM
4. Other types
IV. CACHE MEMORY ORGANIZATION
1. CACHE MEMORY & NEED FOR CACHE:
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access
memory. The cache is a smaller, faster memory which stores copies of the data from frequently used main
memory locations.
As long as most memory accesses are cached memory locations, the average latency of memory accesses will
be closer to the cache latency than to the latency of main memory.
A cache memory is a small, very fast memory that retains copies of recently used information frommain
memory. It operates transparently to theprogrammer, automatically deciding which valuesto keep and which to
overwrite.
The processor operates at its high clock rate only when the memory items it requires are held in the
cache.The overall system performance depends strongly on the proportion of the memory accesses which can
be satisfied by the cache
Address translation
Row decoding & read out
Column decoding
& selection
Tag comparison & validation
Add- ress
Addresses that are 0 mod 4
Addresses that are 2 mod 4
Addresses that are 1 mod 4
Addresses that are 3 mod 4
Return data
Data in
Data out
Dispatch (based on 2 LSBs of address)
Bus cycle
Memory cycle
0
1
2
3
0
1
2
3
Module accessed
Time
13
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Cache space (~KBytes) is much smaller than mainmemory (~MBytes) Items have to be placed in the
cache so that theyare available there when (and possibly only when)they are needed.
When memory size increases cost, speed, memory access time altogether decreases.
Processor speed ≠ memory speed
There is a huge gap between processor speed and memory speed when compared to the improvement and
processor performance
Cache memories act as intermediaries between the superfast processor and the much slower main
memory.
Multiple caches
2. BASIC CACHE TERMS, DESIGN PARAMETERS OF CACHE MEMORY
An access to an item which is in the cache or finding required data in cache:cache hit
An access to an item which is not in the cacheor not finding the required data in cache : cachemiss.
The proportion of all memory accesses that aresatisfied by the cache or fraction of data accesses that can
be satisfied from cache as opposed to slower memory (main memory) : hit rate
The proportion of all memory accesses that are notsatisfied by the cache: miss rate
The miss rate of a well-designed cache: few %
Cfast – cache memory access cycle
Cslow – slower memory (main memory) access cycle
Ceff – effective memory cycle time
One level of cache with hit rate h
Ceff= hCfast + {(1 – h)(Cslow + Cfast) }
Ceff= Cfast + (1 – h)Cslow
Ceff =Cfast (when hit rate (h) = 1)
Ceff =Cfast creates an illusion that entire memory space consist of fast memory (cache memory)
14
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Compulsory misses:also called as “cold-start miss”. Occurs at first access to any cache line. With on-demand
fetching, first access to any item is a miss. Some “compulsory” misses can be avoided by prefetching.
Capacity misses:Since cache capacity is limited, after accessing all cache block it should be overwritten with
next set of instruction. We have to oust (throw out) some items to make room for others. This leads to misses
that are not incurred with an infinitely large cache.
Conflict misses:Also called as “collision miss”, occurs when useless data are placed in cache that forces to
overwrite useful data to bring new required data block.Occasionally, there is free room, or space occupied by
useless data, but the mapping/placement scheme forces us to displace useful items to bring in other items.
This may lead to misses in future.
DESIGN PARAMETERS:
Cache size: in bytes or words, a larger cache can hold more of the program‟s useful data but is more costly
and likely to be slower.
Block size or cache line width: “unit of data transfer between cache and main memory”. With a larger cache
line, more data is brought in cache with each miss. This can improve the hit rate but also may bring low-
utility data in.
Placement policy: To determine where an incoming cache line can be stored (where to store memory (data)
coming from main memory). More flexible policies imply higher hardware cost and may or may not have
performance benefits (due to more complex data location).
Replacement policy: To determine which block (cache) can be overwritten.Determining which of several
existing cache blocks (into which a new cache line can be mapped) should be overwritten. Typical policies:
choosing a random or the least recently used block.
Replacement in 2 ways:
1. choosing random block
2. choosing least recently used block.
Write policy: To determine Determining if updates to cache words are immediately forwarded to main
(write-through) or modified blocks are copied back to main if and when they must be replaced (write-back or
copy-back).
o When to forward / update main memory or the cache word are updated (memory write)
o Modified cache blocks are copied entirely replacing main memory
o When to transfer updated main memory to cache (copy back or write back policy)
REPLACEMENT ALGORITHMS: (Refer page no. 229-230 in XEROX )
3. WHAT MAKES THE CACHE WORK?
How can this work?The answer is: locality
During execution of a program, memory referencesby the processor, for both instructions and data,tend to
cluster: once an area of the program isentered, there are repeated references to a smallset of instructions (loop,
subroutine) and data(components of a data structure, local variables orparameters on the stack).
Cache improves performance of modern processor because of 2 locality properties of memory access patterns
in typical programs.
locality properties cause instruction and data at given given point in a programs execution to reside in cache
that results in high cache hit rates (90-98%) and low cache miss rates (2-10%)
15
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Temporal locality (locality in time): If an item isreferenced, it will tend to be referenced again
soon.Instruction or data once accessed & it will take more time to do a second access and further accesses.
Spacial locality (locality in space): If an item isreferenced, items whose addresses are close bywill tend to be
referenced soon.Consecutive access of nearby memory location frequently.
4. CACHE ORGANIZATION (MAPPING)
(Refer page number 221 - 227 in xerox)
Direct Mapping Advantages: • Simple and cheap; • The tag field is short; only those bits have to be stored which are not used to address the cache (compare with the following approaches); • Access is very fast. Disadvantage: • A given block fits into a fixed cache location a given cache line will be replaced whenever there is a reference to another memory block which fits to the same line, regardless what the status of the other cache lines is This can produce a low hit ratio, even if only a verysmall part of the cache is effectively used.
16
CMageshKumar_AP_AIHT CS2071_Computer Architecture
5. CACHE PERFORMANCE MEASURE
For a given cache size, the following design issues and tradeoffs exist:
Line width (2W). Too small a value for W causes a lot of main memory accesses; too large a value increases the
miss penalty and may tie up cache space with low-utility items that are replaced before being used.
Set size or associativity (2S). Direct mapping (S = 0) is simple and fast; greater associativity leads to more
complexity, and thus slower access, but tends to reduce conflict misses. More on this later.
Line replacement policy. Usually LRU (least recently used) algorithm or some approximation thereof; not an
issue for direct-mapped caches. Somewhat surprisingly, random selection works quite well in practice.
Write policy. Modern caches are very fast, so that write-through is seldom a good choice. We usually implement
write-back or copy-back, using write buffers to soften the impact of main memory latency.
Performance characteristics of two level memories: (Refer page no. 243-244 in xerox)
6. CACHE AND MAIN MEMORY
(Refer page no. 345-346 in text book B.Parhami)
Associative Mapping Advantages: • Associative mapping provides the highest flexibility concerning the line to be replaced when a newblock is read into the cache. Disadvantages: • Complex • The tag field is long • Fast access can be achieved only using highperformance associative memories for the cache,which is difficult and expensive.
Split cache: separate instruction and data caches (L1)
Unified cache: holds instructions and data (L1, L2, L3)
Harvard architecture: separate instruction and data memories
Von Neumann architecture: one memory for instructions and
data
The writing problem:
Write-through slows down the cache to allow main to catch up.
Write-back or copy-back is less problematic, but still hurts
performance due to two main memory accesses in some cases.
Solution: Provide write buffers for the cache so that it does not
have to wait for main memory to catch up.
Advantages of unified caches:
- they are able to better balance the load between instruction and
data fetches depending on the dynamics of the program execution;
- design and implementation are cheaper.
Advantages of split caches (Harvard Architectures)
- competition for the cache between instructionprocessing and
execution units is eliminatedinstruction fetch can proceed in
parallel with memory access from the execution unit.
17
CMageshKumar_AP_AIHT CS2071_Computer Architecture
7. CACHE COHERENCY
(Refer page number 228 - 229 in xerox)
(Refer page no. 512-514 in text book B.Parhami)
VI. SECONDARY STORAGE(MASS MEMORY CONCEPTS)
(Refer page number 200 – 218 in xerox)
(Refer page no. 353 – 365 in text book B.Parhami)
1. Disk Memory Basics
2. Organizing Data on Disk
3. Disk Performance
4. Disk Caching
5. Disk Arrays and RAID (Refer page number 209 in xerox)
6. Other Types of Mass Memory
VII. VIRTUAL MEMORY AND PAGING
(Refer page number 230 - 243 in xerox)
1.The Need for Virtual Memory
2.Address Translation in Virtual Memory
3.Translation Lookaside Buffer
4.Page Placement and Replacement
5.Main and Mass Memories
6.Improving Virtual Memory Performance
18
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Page Table
• The page table has one entry for each page of thevirtual memory space.
• Each entry of the page table holds the address ofthe memory frame which stores the respectivepage, if that page is
in main memory.
• Each entry of the page table also includes somecontrol bits which describe the status of the page:
- whether the page is actually loaded into mainmemory or not;
- if since the last loading the page has beenmodified;
- information concerning the frequency ofaccess, etc.
Problems:
- The page table is very large (number of pagesin virtual memory space is very large).
- Access to the page table has to be very fast the page table has to be stored in very fastmemory, on chip.
• A special cache is used for page table entries,called translation lookaside buffer (TLB); it works inthe same way
as an ordinary memory cache andcontains those page table entries which have beenmost recently used.
• The page table is often too large to be stored inmain memory. Virtual memory techniques are used to store the
page table itself only part of thepage table is stored in main memory at a givenmoment.
The page table itself is distributed along thememory hierarchy:
- TLB (cache)
- main memory
- disk
Memory Reference with Virtual Memory and TLB
• Memory access is solved by hardware except thepage fault sequence which is executed by the OSsoftware.
•The hardware unit which is responsible fortranslation of a virtual address into a physical one isthe Memory
Management Unit (MMU).