18
1 CMageshKumar_AP_AIHT CS2071_Computer Architecture ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103 DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING CS2071 COMPUTER ARCHITECTURE Faculty: C.MAGESHKUMAR Class : IV EIE A&BSemester: VII UNIT IV MEMORY SYSTEM CONTENT Page no. I. Review of digital design 2 1. Signals, logic operators and gates 2 2. Gates as control element 2 3. Combinational circuits 4 4. Programmable combinational parts 4 5. Sequential circuits 5 II. Main memory concepts 7 1. Memory definition 7 2. Memory hierarchy 7 3. Memory performance parameters 9 4. Memory structure and memory cycle, Memory chip organization 9 5. Hitting memory wall, Pipelined memory and interleaved memory 11 III. Types of memory 12 1. Types 12 2. Static RAM 12 3. Dynamic RAM 12 4. Other types 12 IV. Cache memory organization 12 1. Cache memory & need for cache 12 2. Basic cache terms, Design parameters of cache memory 13 3. What makes the cache work? 14 4. Cache organization (mapping) 15 5. Cache performance measure 16 6. Cache and main memory 16 7. Cache coherency 17 V. Secondary storage(Mass Memory Concepts) 17 VI. Virtual memory and paging 17

cs2071 new notes 4

Embed Size (px)

DESCRIPTION

cs2071 new notes

Citation preview

Page 1: cs2071 new notes 4

1

CMageshKumar_AP_AIHT CS2071_Computer Architecture

ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103 DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING

CS2071 COMPUTER ARCHITECTURE

Faculty: C.MAGESHKUMAR Class: IV EIE A&BSemester: VII

UNIT IV –MEMORY SYSTEM

CONTENT Page no.

I. Review of digital design 2

1. Signals, logic operators and gates 2

2. Gates as control element 2

3. Combinational circuits 4

4. Programmable combinational parts 4

5. Sequential circuits 5

II. Main memory concepts 7

1. Memory – definition 7

2. Memory hierarchy 7

3. Memory performance parameters 9

4. Memory structure and memory cycle, Memory chip organization 9

5. Hitting memory wall, Pipelined memory and interleaved memory 11

III. Types of memory 12

1. Types 12

2. Static RAM 12

3. Dynamic RAM 12

4. Other types 12

IV. Cache memory organization 12

1. Cache memory & need for cache 12

2. Basic cache terms, Design parameters of cache memory 13

3. What makes the cache work? 14

4. Cache organization (mapping) 15

5. Cache performance measure 16

6. Cache and main memory 16

7. Cache coherency 17

V. Secondary storage(Mass Memory Concepts) 17

VI. Virtual memory and paging 17

Page 2: cs2071 new notes 4

2

CMageshKumar_AP_AIHT CS2071_Computer Architecture

REVIEW OF DIGITAL DESIGN

1. SIGNALS, LOGIC OPERATORS AND GATES:

Signals:

All information elements in digital computers including instructions, numbers, and symbols are

encoded as electronic signals that are almost always two-valued. (binary).

Binary signals can be represented by the presence or absence of some electrical property such as voltage,

current, field or charge.

Signals 1.Analogsignal (continuous signal)

2. Digital signal (binary signal)

Circuits:

Combinational digital circuit (memoryless circuit) ex: multiplexer, decode, encoder

Sequential digital circuits ( circuit with memory) ex: latches, flip-flops, register

Logic operators:

Variations in Gate Symbols

Gates with more than two inputs and/or with inverted signals at input or output.

2. GATES AS CONTROL ELEMENTS:

Tristate buffer:

whose output is equal to data input when control signal is asserted (declared) and assumes an

indeterminate value when “e” is de-asserted.

Used to effectively isolate output from input.

An AND gate and a tristate buffer act as controlled switches or valves. An inverting buffer is

logically the same as a NOT gate.

0‟s 1‟s

off On

False True

Low High

Negative Positive

x y

AND Name XOR OR NOT

Graphical symbol

x y

Operator sign and alternate(s)

x y x y xy

x y

x

x or x _

x y or xy Arithmetic expression

x y 2xy x y xy 1 x

Output is 1 iff:

Input is 0 Both inputs

are 1s At least one

input is 1 Inputs are not equal

OR NOR NAND AND XNOR

Page 3: cs2071 new notes 4

3

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Wired OR and Bus Connections:

Wired OR allows tying together of several controlled signals.

Control/Data Signals and Signal Bundles:

Arrays of logic gates represented

by a single gate symbol.

Designing Gate Networks

AND-OR, NAND-NAND, OR-AND, NOR-NOR

Logic optimization: cost, speed, power dissipation

A two-level AND-OR circuit and two equivalent circuits are shown below

BCD-to-Seven-Segment Decoder:

The logic circuit that generates the enable signal for the lowermost segment (number 3) in a seven-

segment display unit.

Enable/Pass signal e

Data in x

Data out x or 0

Data in x

Enable/Pass signal e

Data out x or “high impedance”

(a) AND gate for controlled transfer (b) Tristate buffer

(c) Model for AND switch.

x

e

No data or x

0

1 x

e

ex

0

1

0

(d) Model for tristate buffer.

e

e

e Data out (x, y, z,

or high impedance)

(b) Wired OR of t ristate outputs

e

e

e

Data out (x, y, z, or 0)

(a) Wired OR of product terms

z

x

y

z

x

y

z

x

y

z

x

y

/ 8

/

8 / 8

Compl

/ 32

/ k

/ 32

Enable

/ k

/ k

/ k

(b) 32 AND gates (c) k XOR gates (a) 8 NOR gates

(a) AND-OR circuit

z

x

y

x

y

z

(b) Intermediate circuit

(c) NAND-NAND equivalent

z

x

y

x

y

z

z

x

y

x

y

z

x 3

x 2

x 1

x 0

Signals to enable or turn on the segments

4-bit input in [0, 9]

e 0

e 5

e 6

e 4

e 2

e 1

e 3

1

2 4

5

0

3

6

Page 4: cs2071 new notes 4

4

CMageshKumar_AP_AIHT CS2071_Computer Architecture

3. COMBINATIONAL (MEMORYLESS) CIRCUITS : (COMBINATIONAL PARTS)

High-level building blocks

Much like prefab parts used in building a house

Arithmetic components (adders, multipliers, ALUs)

examples for combinational part are:

multiplexers,

decoders/demultiplexers,

encoders

Multiplexer (MUX): (many to one)

Multiplexer (mux), or selector, allows one of several inputs to be selected and routed to output

depending on the binary value of a set of selection or address signals provided to it.

Decoders/Demultiplexers

A decoder allows the selection of one of 2a options using an a-bit address as input. A demultiplexer

(demux) is a decoder that only selects an output if its enable signal is asserted.

4. PROGRAMMABLE COMBINATIONAL PARTS

A programmable combinational part can do the job of many gates or gate networks.

To avoid having to use large number of small-scale integrated circuits for implementing Boolean

function of several variables.

Programmed by cutting existing connections (fuses) or establishing new connections (antifuses)

Programmable ROM (PROM)

Programmable connections and their use in a PROM are shown below

Programmable array logic (PAL): when OR array has fixed connections but the inputs to

AND gates can be programmed.

Programmable logic array (PLA): when both AND and OR arrays are programmed.

Programmable combinational logic: general structure and two classes known as PAL and

PLA devices. Not shown is PROM with fixed AND array (a decoder) and programmable

OR array is shown below

x

x

y

z

1

0

x

x

z

y

x

x

y

z

1

0

y

/ 32

/ 32

/ 32 1

0

1

0

3

2

z

y 1 0

1

0

1

0

y 1

y 0

y 0

(a) 2-to-1 mux (b) Switch view (c) Mux symbol

(d) Mux array (e) 4-to-1 mux with enable (e) 4-to-1 mux design

0

1

y

1 1

1

0

0 0

x

x x

x

1

0

2

3

x

x

x

x

0

1

2

3

z

e (Enable)

y 1

y 0

x 0

x 3

x 2

x 1

1

0

3

2

y 1

y 0

x 0

x 3

x 2

x 1 e

1

0

3

2

y 1

y 0

x 0

x 3

x 2

x 1

(a) 2-to-4 decoder (b) Decoder symbol

(c) Demultiplexer, or decoder with “enable”

(Enable)

. . .

.

.

.

Inputs

Outputs

(a) Programmable OR gates

w

x

y

z

(b) Logic equivalent of part a

w

x

y

z

(c) Programmable read-only memory (PROM)

De

cod

er

Page 5: cs2071 new notes 4

5

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Timing and Circuit Considerations:

Changes in gate/circuit output, triggered by changes in its inputs, are not instantaneous

Gate delay (δ): a fraction nanoseconds, delay time taken by the gate to give the output after giving

the input

Wire delay, previously negligible, is now important (electronic signals travel about 15 cm per ns)

Circuit simulation to verify function and timing

CMOS Transmission Gates:

A CMOS transmission gate

and its use in building a 2-to-1 mux.

5. SEQUENTIAL CIRCUITS (WITH MEMORY) (NOTE:Please Refer Page No.28 – 34,Chapter 2 in B.Parhami,”Computer Architecture” for Detailed

Description)

A programmable sequential part contain gates and memory elements

Programmed by cutting existing connections (fuses) or establishing new connections

(antifuses)

Design of sequential circuits exhibiting memory requires the use of storage elements

capable of holding information (a single bit) can be set to „1‟ or reset to „0‟

Programmable array logic (PAL)

Field-programmable gate array (FPGA)

Both types contain macrocells and interconnects

Latches, Flip-Flops, and Registers

AND

array

(AND

plane)

OR

array

(OR

plane)

. . .

. . .

.

.

.

Inputs

Outputs

(a) General programmable

combinational logic

(b) PAL: programmable

AND array, fixed OR array

8-input

ANDs

(c) PLA: programmable

AND and OR arrays

6-input

ANDs

4-input

ORs

z

x

x

0

1

(a) CMOS transmission gate: circuit and symbol

(b) Two-input mux built of two transmission gates

TG

TG TG

y

P

N

R Q

Q S

D

Q

Q

C

Q

Q

D

C

(a) SR latch (b) D latch

Q

C

Q

D

Q

C

Q

D

(e) k -bit register (d) D flip-flop symbol (c) Master-slave D flip-flop

Q

C

Q

D FF

/

/

k

k

Q

C

Q

D FF

R

S

Page 6: cs2071 new notes 4

6

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Sequential Machine Implementation (Hardware realization of Moore and Mealy sequential machines)

DESIGNING SEQUENTIAL CIRCUITS:

Useful Sequential Parts:

High-level building blocks

Much like prefab closets used in building a house

Other memory components will be

SRAM details, DRAM, Flash

Here we cover three useful parts:

shift register, register file (SRAM basics), counter

(NOTE: PLEASE REFER CHAPTER 1 AND CHAPTER 2 IN B.PARHAMI,”COMPUTER

ARCHITECTURE” FOR DETAILED DESCRIPTIONS)

Next-state logic

State register / n

/ m

/ l

Inputs Outputs

Next-state excitation signals

Present state

Output logic

Only for Mealy machine

Output

Q C

Q

D

e

Inputs

Q C

Q

D

Q C

Q

D

FF2

FF1

FF0

q

d

Decod

er

/ k

/ k

/

h

Write enable

Read address 0

Read address 1

Read data 0

Write data

Read enable

2 k -bit registers h / k

/ k

/ k

/ k

/ k

/ k

/ k

/ h

Write address

Muxes

Read data 1

/

k

/

h

/

h

/

h

/

k

/

h

Write enable

Read addr 0

/

k

/

k

Read addr 1

Write data Write addr

Read data 0

Read enable

Read data 1

(a) Register file with random access

(b) Graphic symbol for register file

Q C

Q

D

FF

/ k

Q C

Q

D

FF

Q C

Q

D

FF

Q C

Q

D

FF

/

k

Push

/

k

Input

Output Pop

Full

Empty

(c) FIFO symbol

Page 7: cs2071 new notes 4

7

CMageshKumar_AP_AIHT CS2071_Computer Architecture

II. MAINMEMORY CONCEPTS:

1. MEMORY – DEFINITION

Memory:

Memory refers to a physical device used to store programs or data on a temporary or permanent basis for

use in a computer or other electronic device.

Memory cell:

A memory cell is capable of storing one bit of information. It is usually organized in the form of an array.

Components of the Memory System

• Main memory: fast, random access, expensive, located close (but not inside) the CPU and used to store

program and data which are currently manipulated by the CPU.

• Secondary memory: slow, cheap, direct access, located remotely from the CPU.

2. MEMORY HIERARCHY

The Need for a Memory Hierarchy:

To match memory speed with processor speed

o Memory holding the program must be accessible in nanoseconds or less.

The widening speed gap between CPU and main memory

o Processor operations take of the order of 1 ns

o Memory access requires 10s or even 100s of ns

Memory bandwidth limits the instruction execution rate

o Each instruction executed involves at least one memory access. Hence, a few to 100s of MIPS is

the best that can be achieved.

o A fast buffer memory can help bridge the CPU-memory gap

o The fastest memories are expensive and thus not very large

Problems with the Memory System

What do we need?

We need memory to fit very large programs and towork at a speed comparable to that of themicroprocessors.

Main problem:

- microprocessors are working at a very high rateand they need large memories;

- memories are much slower than microprocessors;

Facts:

- the larger a memory, the slower it is;

- the faster the memory, the greater the cost/bit.

A Solution:

It is possible to build a composite memory system which combines a small, fast memory and a large slow

main memory and which behaves (most of the time) like a large fast memory.

The two level principle above can be extended into a hierarchy of many levels including the secondary

memory (disk store).

The effectiveness of such a memory hierarchy is based on property of programs called the principle of

locality

Page 8: cs2071 new notes 4

8

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Some typical characteristics:

1. Processor registers:

- 32 registers of 32 bits each = 128 bytes

- access time = few nanoseconds

2. On-chip cache memory:

- capacity = 8 to 32 Kbytes

- access time = ~10 nanoseconds

3. Off-chip cache memory:

- capacity = few hundred Kbytes

- access time = tens of nanoseconds

4. Main memory:

- capacity = tens of Mbytes

- access time = ~100 nanoseconds

5. Hard disk:

- capacity = few Gbytes

- access time = tens of milliseconds

The key to the success of a memory hierarchyis if dataand instructions can be distributed across the

memoryso that most of the time they are available, when needed,on the top levels of the hierarchy.

• The data which is held in the registers is under thedirect control of the compiler or of the assembler

programmer.

• The contents of the other levels of the hierarchy aremanaged automatically:

- Migration of data/instructions to and fromcaches is performed under hardware control;

- Migration between main memory and backupstore is controlled by the operating system

(withhardware support).

Page 9: cs2071 new notes 4

9

CMageshKumar_AP_AIHT CS2071_Computer Architecture

3. MEMORY PERFORMANCE PARAMETERS(Refer page no. 167 & 168 in Xerox)

Access methods: sequential access and random access

Performance: Access time, memory cycle time, transfer rate

4. MEMORY STRUCTURE AND MEMORY CYCLE, MEMORY CHIP ORGANIZATION

(With addition to the below notes & pictures, also refer page number 175 to 191 in Xerox)

SRAM:

Basically large array of storage cells that are accessed like registers

SRAM memory cell requires 4-6 transistors / bit

SRAM holds the stored data as long as it is powered on.

These storage cells are edge triggered D-flip flops

Limitations of flip flops:

o Adds complexity to cells

o Only fewer cells can be mounted on chip.

So, Latches are used instead of flip-flops but it will take more time write/read

Memory Structure and SRAM(page no. 317 in B.Parhami)

Conceptual inner structure of a 2hg SRAM chip and its shorthand representation is shown below

SRAM with Bidirectional Data Bus

When data input and output of an SRAM chip are shared or connected to a bidirectional data bus, output

must be disabled during write operations.

/ h

Write enable / g

Data in

Address

Data out

Chip select

Q

C

Q

D

FF

Q

C

Q

D

FF

Q

C

Q

D

FF

/

g

Output enable

1

0

2 –1 h

Address decoder

Storage cells

/

g

/

g

/

g

WE

CS

OE

D in D out

Addr

.

.

.

/ h

/

g

Write enable

Data in/out

Chip select

Output enable

Address Data in Data out

Page 10: cs2071 new notes 4

10

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Multiple-Chip SRAM

Eight 128K 8 SRAM chips forming a 256K 32 memory unit is shown below

DRAM and Refresh Cycles

DRAM:

Stores data as electric charge on tiny capacitor, that is accessed by MOS transistor

When word line is asserted (declared),

o to write:

low voltage on bit line causes capacitor to discharged. i.e., bit = „0‟

high voltage on bit line causes capacitor to charged. i.e., bit = „1‟

o to read

read operation takes in 2 steps:

step1:row is accessed

step2: column selection

bit line is prefetched first to halfway voltage and sensed by sense amplifier.

Reading operation destroys the content, so a write operation is enabled after reading. This is

also called destructive readout

Single-transistor DRAM cell, which is considerably simpler than SRAM cell, leads to dense, high-capacity

DRAM memory chips.

/

WE

CS

OE

D in D out

Addr

WE

CS

OE

D in D out

Addr

WE

CS

OE

D in D out

Addr

WE

CS

OE

D in D out

Addr

WE

CS

OE

D in D out

Addr

WE

CS

OE

D in D out

Addr

WE

CS

OE

D in D out

Addr

18

/

17

32 WE

CS

OE

D in D out

Addr

Data in

Data out, byte 3

Data out, byte 2

Data out, byte 1

Data out, byte 0

MSB

Address

Word line

Capacitor

Bit line

Pass transistor

Word line

Bit line

Compl. bit line

Vcc

(a) DRAM cell (b) Typical SRAM cell

Page 11: cs2071 new notes 4

11

CMageshKumar_AP_AIHT CS2071_Computer Architecture

DRAM Refresh Cycles and Refresh Rate:

o Variations in the voltage across a DRAM cell capacitor after writing a 1 and subsequent refresh

operations

o Leakage of charge causes (tiny capacitor) data to be erased after fraction of second due to discharging

nature of capacitor. So DRAM should be periodically refreshed.

o Refreshing: write operation is enabled when capacitor charge value nears the threshold voltage.

DRAM Packaging:

24-pin dual in-line package (DIP) : Typical DRAM package housing a 16M 4 memory

MEMORY CYCLE:(Refer page number 182, 183 in xerox)

5. HITTING MEMORY WALL, PIPELINED MEMORY AND INTERLEAVED MEMORY

Hitting the Memory Wall:

Memory density and capacity have grown

along with the CPU power and complexity,

but memory speed has not kept pace.

Bridging the CPU-Memory Speed Gap

Two ways of using a wide-access memory to bridge the speed gap between the processor and memory.

Idea: Retrieve more data from memory with each access

Time

Threshold voltage

0 Stored

1 Written Refreshed Refreshed Refreshed

10s of ms before needing refresh cycle

Voltage for 1

Voltage for 0

Legend:

Ai CAS Dj NC OE RAS WE

1 2 3 4 5 6 7 8 9 10 11 12

24 23 22 21 20 19 18 17 16 15 14 13

A4 A5 A6 A7 A8 A9 D3 D4 CAS OE Vss Vss

A0 A1 A2 A3 A10 D1 D2 RAS WE Vcc Vcc NC

Address bit i Column address strobe Data bit j No connection Output enable Row address strobe Write enable

1990 1980 2000 2010

1

10

10

Re

lati

ve p

erf

orm

ance

Calendar year

Processor

Memory

3

6

Wide-access

memory

.

.

.

Narrow bus to

processor Mux

Wide-access

memory

. . .

Wide bus to

processor

.

.

.

Mux

(a) Buffer and mult iplexer at the memory side

(a) Buffer and mult iplexer at the processor side

. . .

Page 12: cs2071 new notes 4

12

CMageshKumar_AP_AIHT CS2071_Computer Architecture

PIPELINED MEMORY AND INTERLEAVED MEMORY:

(Refer page no. 325 in text book B.Parhami)

Memory latency may involve other supporting operationsbesides the physical access itself

o Virtual-to-physical address translation

o Tag comparison to determine cache hit/miss

Pipelined cache memory is shown below

Memory Interleaving:

o Interleaved memory is more flexible than wide-access memory in that it can handle multiple

independent accesses at once.

III. TYPES OF MEMORY (Refer page no. 171 to 175, 193 to 196, 200 to 209 in Xerox)

1. TYPES

2. Static RAM

3. Dynamic RAM

4. Other types

IV. CACHE MEMORY ORGANIZATION

1. CACHE MEMORY & NEED FOR CACHE:

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access

memory. The cache is a smaller, faster memory which stores copies of the data from frequently used main

memory locations.

As long as most memory accesses are cached memory locations, the average latency of memory accesses will

be closer to the cache latency than to the latency of main memory.

A cache memory is a small, very fast memory that retains copies of recently used information frommain

memory. It operates transparently to theprogrammer, automatically deciding which valuesto keep and which to

overwrite.

The processor operates at its high clock rate only when the memory items it requires are held in the

cache.The overall system performance depends strongly on the proportion of the memory accesses which can

be satisfied by the cache

Address translation

Row decoding & read out

Column decoding

& selection

Tag comparison & validation

Add- ress

Addresses that are 0 mod 4

Addresses that are 2 mod 4

Addresses that are 1 mod 4

Addresses that are 3 mod 4

Return data

Data in

Data out

Dispatch (based on 2 LSBs of address)

Bus cycle

Memory cycle

0

1

2

3

0

1

2

3

Module accessed

Time

Page 13: cs2071 new notes 4

13

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Cache space (~KBytes) is much smaller than mainmemory (~MBytes) Items have to be placed in the

cache so that theyare available there when (and possibly only when)they are needed.

When memory size increases cost, speed, memory access time altogether decreases.

Processor speed ≠ memory speed

There is a huge gap between processor speed and memory speed when compared to the improvement and

processor performance

Cache memories act as intermediaries between the superfast processor and the much slower main

memory.

Multiple caches

2. BASIC CACHE TERMS, DESIGN PARAMETERS OF CACHE MEMORY

An access to an item which is in the cache or finding required data in cache:cache hit

An access to an item which is not in the cacheor not finding the required data in cache : cachemiss.

The proportion of all memory accesses that aresatisfied by the cache or fraction of data accesses that can

be satisfied from cache as opposed to slower memory (main memory) : hit rate

The proportion of all memory accesses that are notsatisfied by the cache: miss rate

The miss rate of a well-designed cache: few %

Cfast – cache memory access cycle

Cslow – slower memory (main memory) access cycle

Ceff – effective memory cycle time

One level of cache with hit rate h

Ceff= hCfast + {(1 – h)(Cslow + Cfast) }

Ceff= Cfast + (1 – h)Cslow

Ceff =Cfast (when hit rate (h) = 1)

Ceff =Cfast creates an illusion that entire memory space consist of fast memory (cache memory)

Page 14: cs2071 new notes 4

14

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Compulsory misses:also called as “cold-start miss”. Occurs at first access to any cache line. With on-demand

fetching, first access to any item is a miss. Some “compulsory” misses can be avoided by prefetching.

Capacity misses:Since cache capacity is limited, after accessing all cache block it should be overwritten with

next set of instruction. We have to oust (throw out) some items to make room for others. This leads to misses

that are not incurred with an infinitely large cache.

Conflict misses:Also called as “collision miss”, occurs when useless data are placed in cache that forces to

overwrite useful data to bring new required data block.Occasionally, there is free room, or space occupied by

useless data, but the mapping/placement scheme forces us to displace useful items to bring in other items.

This may lead to misses in future.

DESIGN PARAMETERS:

Cache size: in bytes or words, a larger cache can hold more of the program‟s useful data but is more costly

and likely to be slower.

Block size or cache line width: “unit of data transfer between cache and main memory”. With a larger cache

line, more data is brought in cache with each miss. This can improve the hit rate but also may bring low-

utility data in.

Placement policy: To determine where an incoming cache line can be stored (where to store memory (data)

coming from main memory). More flexible policies imply higher hardware cost and may or may not have

performance benefits (due to more complex data location).

Replacement policy: To determine which block (cache) can be overwritten.Determining which of several

existing cache blocks (into which a new cache line can be mapped) should be overwritten. Typical policies:

choosing a random or the least recently used block.

Replacement in 2 ways:

1. choosing random block

2. choosing least recently used block.

Write policy: To determine Determining if updates to cache words are immediately forwarded to main

(write-through) or modified blocks are copied back to main if and when they must be replaced (write-back or

copy-back).

o When to forward / update main memory or the cache word are updated (memory write)

o Modified cache blocks are copied entirely replacing main memory

o When to transfer updated main memory to cache (copy back or write back policy)

REPLACEMENT ALGORITHMS: (Refer page no. 229-230 in XEROX )

3. WHAT MAKES THE CACHE WORK?

How can this work?The answer is: locality

During execution of a program, memory referencesby the processor, for both instructions and data,tend to

cluster: once an area of the program isentered, there are repeated references to a smallset of instructions (loop,

subroutine) and data(components of a data structure, local variables orparameters on the stack).

Cache improves performance of modern processor because of 2 locality properties of memory access patterns

in typical programs.

locality properties cause instruction and data at given given point in a programs execution to reside in cache

that results in high cache hit rates (90-98%) and low cache miss rates (2-10%)

Page 15: cs2071 new notes 4

15

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Temporal locality (locality in time): If an item isreferenced, it will tend to be referenced again

soon.Instruction or data once accessed & it will take more time to do a second access and further accesses.

Spacial locality (locality in space): If an item isreferenced, items whose addresses are close bywill tend to be

referenced soon.Consecutive access of nearby memory location frequently.

4. CACHE ORGANIZATION (MAPPING)

(Refer page number 221 - 227 in xerox)

Direct Mapping Advantages: • Simple and cheap; • The tag field is short; only those bits have to be stored which are not used to address the cache (compare with the following approaches); • Access is very fast. Disadvantage: • A given block fits into a fixed cache location a given cache line will be replaced whenever there is a reference to another memory block which fits to the same line, regardless what the status of the other cache lines is This can produce a low hit ratio, even if only a verysmall part of the cache is effectively used.

Page 16: cs2071 new notes 4

16

CMageshKumar_AP_AIHT CS2071_Computer Architecture

5. CACHE PERFORMANCE MEASURE

For a given cache size, the following design issues and tradeoffs exist:

Line width (2W). Too small a value for W causes a lot of main memory accesses; too large a value increases the

miss penalty and may tie up cache space with low-utility items that are replaced before being used.

Set size or associativity (2S). Direct mapping (S = 0) is simple and fast; greater associativity leads to more

complexity, and thus slower access, but tends to reduce conflict misses. More on this later.

Line replacement policy. Usually LRU (least recently used) algorithm or some approximation thereof; not an

issue for direct-mapped caches. Somewhat surprisingly, random selection works quite well in practice.

Write policy. Modern caches are very fast, so that write-through is seldom a good choice. We usually implement

write-back or copy-back, using write buffers to soften the impact of main memory latency.

Performance characteristics of two level memories: (Refer page no. 243-244 in xerox)

6. CACHE AND MAIN MEMORY

(Refer page no. 345-346 in text book B.Parhami)

Associative Mapping Advantages: • Associative mapping provides the highest flexibility concerning the line to be replaced when a newblock is read into the cache. Disadvantages: • Complex • The tag field is long • Fast access can be achieved only using highperformance associative memories for the cache,which is difficult and expensive.

Split cache: separate instruction and data caches (L1)

Unified cache: holds instructions and data (L1, L2, L3)

Harvard architecture: separate instruction and data memories

Von Neumann architecture: one memory for instructions and

data

The writing problem:

Write-through slows down the cache to allow main to catch up.

Write-back or copy-back is less problematic, but still hurts

performance due to two main memory accesses in some cases.

Solution: Provide write buffers for the cache so that it does not

have to wait for main memory to catch up.

Advantages of unified caches:

- they are able to better balance the load between instruction and

data fetches depending on the dynamics of the program execution;

- design and implementation are cheaper.

Advantages of split caches (Harvard Architectures)

- competition for the cache between instructionprocessing and

execution units is eliminatedinstruction fetch can proceed in

parallel with memory access from the execution unit.

Page 17: cs2071 new notes 4

17

CMageshKumar_AP_AIHT CS2071_Computer Architecture

7. CACHE COHERENCY

(Refer page number 228 - 229 in xerox)

(Refer page no. 512-514 in text book B.Parhami)

VI. SECONDARY STORAGE(MASS MEMORY CONCEPTS)

(Refer page number 200 – 218 in xerox)

(Refer page no. 353 – 365 in text book B.Parhami)

1. Disk Memory Basics

2. Organizing Data on Disk

3. Disk Performance

4. Disk Caching

5. Disk Arrays and RAID (Refer page number 209 in xerox)

6. Other Types of Mass Memory

VII. VIRTUAL MEMORY AND PAGING

(Refer page number 230 - 243 in xerox)

1.The Need for Virtual Memory

2.Address Translation in Virtual Memory

3.Translation Lookaside Buffer

4.Page Placement and Replacement

5.Main and Mass Memories

6.Improving Virtual Memory Performance

Page 18: cs2071 new notes 4

18

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Page Table

• The page table has one entry for each page of thevirtual memory space.

• Each entry of the page table holds the address ofthe memory frame which stores the respectivepage, if that page is

in main memory.

• Each entry of the page table also includes somecontrol bits which describe the status of the page:

- whether the page is actually loaded into mainmemory or not;

- if since the last loading the page has beenmodified;

- information concerning the frequency ofaccess, etc.

Problems:

- The page table is very large (number of pagesin virtual memory space is very large).

- Access to the page table has to be very fast the page table has to be stored in very fastmemory, on chip.

• A special cache is used for page table entries,called translation lookaside buffer (TLB); it works inthe same way

as an ordinary memory cache andcontains those page table entries which have beenmost recently used.

• The page table is often too large to be stored inmain memory. Virtual memory techniques are used to store the

page table itself only part of thepage table is stored in main memory at a givenmoment.

The page table itself is distributed along thememory hierarchy:

- TLB (cache)

- main memory

- disk

Memory Reference with Virtual Memory and TLB

• Memory access is solved by hardware except thepage fault sequence which is executed by the OSsoftware.

•The hardware unit which is responsible fortranslation of a virtual address into a physical one isthe Memory

Management Unit (MMU).