Course-module-1 _Compatibility Mode_.pdf

8/22/2019 Course-module-1 _Compatibility Mode_.pdf

1/102

1

Santanu Chaudhury

Computer Architecture EEL308

2

Books:

Computer Organization and Design, The Hardware/Software Interface

Author(s) : Patterson & Hennessy

Imprint: Morgan Kaufmann

Additional Reference :

(i) Computer Architecture and Organisation : J .P. Hayes

(ii) Hamacher& Zacky


2/102

3

Evaluation & Attendance Policy

Minor-1 ; 20 Minor-2 : 20

Major: 35

Tutorial, Assignments, Quiz : 25

At tendance Pol icy:

One grade less if attendance less than 75%

No E-grade if attendance less than 75%

4

Introduction


3/102

5

What is a computer?

An electron ic device that can be programmed for solving aproblem

Components of a Computer

processor

input (mouse, keyboard,scanner, camera)

output (display, printer)

memory (disk drives, DRAM, SRAM, CD)

network

Rapidly changing technology

vacuum tube -> transistor -> IC -> VLSI

doubling every 1.5 years: memory capacity

processor speed

6

Computer Architecture and Organization

Computer Architecture refers to those attributes of a systemvisible to a programmer

instruction set

number of bit s used to represent various data types

i/o mechanisms

techniques for addressing memory

Computer Organization refers to the operational units andtheir interconnections that realize the architecturalspecifications

control signals

interfaces between the computer and peripherals memory technology

Distinction is fundamental

manufacturers can offer computer models with samearchitecture with differences in organization


4/102

7

Structure and Function

A Computer is a complex system

hierarchic nature of complex system essential for their designand description

behavior at each level depends only on a simplified, abstractedcharacterization of the system at the low er level

At each level designer is concerned wi th

structure

CPU - Central Processing Unit

Main memory: Stores data

I/O: moves data between computer and its external environment

System Interconnection

function

data processing data storage

data movement

control

8

Structure - Top Level

Computer

Main

Memory

Input

Output

Systems

Interconnection

Peripherals

Communication

lines

Central

Processing

Unit

Computer


5/102

9

Structure - The CPU

Computer Arithmeticand

Logic Unit

Control

Unit

Internal CPU

Interconnection

Registers

CPU

I/O

Memory

System

Bus

CPU

10

Structure - The Control Unit

CPU

Control

Memory

Control Unit

Registers andDecoders

Sequencing

LogicControl

Unit

ALU

Registers

Internal

Bus

Control Unit


6/102

11

History

First generation computers were made with vacuum valves andused punched cards as the main (non-volatile) storage medium. Ageneral purpose computer of this era was 'ENIAC' (ElectronicNumerical Integrator and Computer) which was completed in 1946.

The next major step in the history of computing was the inventionof the transistor in 1947. Transistorized computers are normallyreferred to as 'Second Generation' and dominated the late 1950sand early 1960s.

12

History

'Third Generation' computers used J ack St. Claire Kilby's invention- the integrated circuit or microchip;

the first integrated circuit was produced in September 1958 butcomputers using them didn't begin to appear until 1963.

In 1964 IBM announced system/ 360with increased storage and

processing capabilities.

formed the foundation of modern computer architecture

In 1971 Intel anounced 4004 - first chip to contain all of the

components of CPU

the microprocessor was born

Fourth generation computers used as underlying technology - verylarge scale integration : VLSI

8086 and all of Intel's processors for the IBM PC and compatibles

Supercomputers of the era were immensely powerful, like the

Cray-1


7/102

13

CRAY X-MP

14

Performance of Computers have shown drastic improvement

over time

Parameter for evaluating Performance: Response Time (latency)

How long does it take for my job to run?

How long does it take to execute a job?

How long must I wait for the database query?

Parameter for evaluating Performance: Throughput

How many jobs can the machine run at once?

What is the average execution rate?

How much work is getting done?

Performance of Computers


8/102

15

Elapsed Time covers everything (disk and memory accesses, I/O , etc.)

a useful indicator, but often not good for comparisonpurposes

CPU time

doesn't count I/O or time spent running other programs

can be broken up into system time, and user time

User CPU time

time spent executing the lines of code that are " in" ourprogram

Execution Time

16

For some program running on machine X,

PerformanceX = 1 / Execut ion t imeX

"X is n times faster than Y"

PerformanceX / PerformanceY = n

A Defini tion of Performance


9/102

17

Clock Cycles

Instead of report ing execution time in seconds, we often usecycles

Internal clock in a computer co-ordinates execution o finstructions

Clock ticks indicate when to start activities (oneabstraction):

cycle time = time between ticks = seconds per cycle

clock rate (frequency) = cycles per second

time

cycleseconds

programcycles

programseconds

18

Could assume that # of cycles = # of instruct ions

However, different instructions take differentamounts of time on different machines.

time

1stinstruction

2ndinstruction

3rdinstruction

4th

5th

6th ..

.

How many cycles are required for a program?


10/102

19

Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more t ime than accessing

registers

Different numbers of cycles for different instructions

20

A given program wil l requi re

some number of instructions (machine instructions)

some number of cycles

some number of seconds

We have a vocabulary that relates these quantities:

cycle time (seconds per cycle)

clock rate (cycles per second)

CPI (cycles per i nstruct ion)a floating point i ntensive application might have a higher CPI

MIPS (millions of ins tructions per second)

this would be higher for a program using simple instruc tions

Terminology


11/102

21

Suppose we have two machines

For some program,

Machine A has a clock cycle time of 10 ns. and a CPI of 2.0Machine B has a clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much?

CPI Example

If the program on both the machines requires

same number of instructions -N, then

machine A requires 2*N*10ns and

machine B requires 1.2*N*20 ns

22

A compi ler designer is try ing to decide between two codesequences for a particular machine. Based on the hardwareimplementation, there are three different classes ofinstructions: Class A, Class B, and Class C, and they requireone, two, and three cycles (respectively).

The first code sequence has 5 instruc tions: 2 of A, 1 of B,and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1of C.

Which sequence will be faster? How much?What is the CPI for each sequence?

Number of Instructions Example

First sequence : 10 cycles, CPI-2

Second Sequence: 9 cycles, CPI - 1.5


12/102

23

Two dif ferent compi lers are being tested for a 100 MHz machinewith three d if ferent c lasses of instruc tions : Class A, Class B,and Class C, which require one, two, and three cycles(respect ively). Both compilers are used to produce code for alarge piece of software.

The f irs t compi ler 's code uses 5 mill ion Class A inst ruct ions, 1mil lion Class B instructions, and 1 mil lion Class C instructions.

The second compiler's code uses 10 million Class Ainstructions, 1 mil lion Class B instructions, and 1 mil lion Class C

instructions.

Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time?

MIPS example

24

Performance best determined by running a real application

Use programs typ ical of expected workload

Or, typical of expected class of applicationse.g., compilers/editors, scientific applications, graphics,

etc.

SPEC (System Performance Evaluation Cooperative)

companies have agreed on a set of real program andinputs

valuable indicator of performance (and compiler

technology)

Benchmarks


13/102

25

SPEC 95Benchmark Descri pti on

g o Artific ial in tellig en ce; p lay s th e g ame o f Go

m88ksim Motorola 88k chip simulator; runs test programg cc Th e Gn u C c om piler gen er atin g SPARC c od e

compress Compresses and decompresses file in memory

li Lisp interpreter

ijpeg Graph ic compression and decompression

p er l Man ip ulates str in gs an d p rime n umb ers

v or tex A d atab as e p ro gram

tomcatv A mesh generation program

swim Shal low water model w ith 513 x 513 gr id

su2cor quantum physics; Monte Carlo simulat ion

hydro2d Astrophysics; Hydrodynamic Naiver Stokes equations

mgrid Mult igrid solver in 3-D potent ial field

applu Parabol ic /ell ipt ic partial di fferent ial equat ions

trub3d Simulates isotropic, homogeneous turbulence in a cube

apsi Solves problems regarding temperature, wind veloci ty , and dist ributio

fp pp p Qu an tu m ch emistry

wave5 Plasma physics; electromagnetic part icle simulation

26

SPEC95

Can a machine with a slower clock rate have better performance?

P entiumClock rate (MHz)

SPECfp

P entium Pro

2

0

4

6

8

3

1

5

7

9

10

200 25015010050


14/102

27

Uniprocessor to Multiprocessor

Multiple processor on a single chip Multi-core microprocessor

Impact more on throughput than on response time

To improve response time may need to rewrite the code totake advantage of mu ltiple cores

28

Instruction Set Architecture


15/102

29

Instruction Set Architecture

A very important abstract ion

interface between hardware and low-level software

standardizes inst ructions, machine language bit patterns, etc.

There can be different implementations of the

same architecture

30

Instructions

Language of the Machine

More primitive than higher level languages

Very restrict ive

Variety of functions a CPU may perform are reflected in itsinstruction set


16/102

31

Instruction Set of Computers :Complex Instruction set

Each instruction in a CISC instruction set might perform a series ofoperations inside the processor.

Reduces the number of instructions required to implement a givenprogram, and allows the programmer to learn a small but flexibleset of instructions.

You can even have a single instruction computer

Since earlier memory was slow and expensive, the CISCphilosophy made sense

Most common microprocessor designs --- including the Intel 80x86

and Motorola 68K series --- also follow the CISC philosphy Later, it was discovered that, by reducing the full set to only the

most frequently used instructions, the computer would get morework done in a shorter amount of time for most applications - RISC

32

Instruction Set of Computers :

Reduced Instruct ion set

Background

With advances in semiconductor technology difference in speedbetween main memory and processor reduced.

a sequence of simple instructions produces the same results asa sequence of complex instructions, but can be implementedwith a simpler (and faster) hardware - assuming that memorycan keep up

RISC forms the basis for modern design


17/102

33

Instruction Set of Computers :Reduced Instruct ion set

RISC characteristics

Simple instruction set.

In a RISC machine, the instruction set contains simple, basicinstructions, from which more complex instructions can be

composed.

Same length instructions.

Each instruction is the same length, so that it may be fetchedin a single operation.

1 machine-cycle instructions. Most instructions complete in one machine cycle

34

RISC Architecture

Well be working with the MIPS instruction set

architecture

similar to other architectures developed since

the 1980's

used by NEC, Nintendo, Silicon Graphics,

Sony


18/102

35

Instructions are bits Programs are stored in memory

to be read or written just like data

Fetch & Execute Cycle

Instructions are fetched and put into a special register in theprocessor

Bits in the register "control" the subsequent actions

Fetch the next instruction and continue

Processor Memory

memory for data, programs,compilers, editors, etc.

Stored Program Concept

36

Elements of Instruction

Elements of an instruc tion

Operation Code

Source Operand Reference

one or two

Result Operand Reference

(may be) Next Instruction Reference


19/102

37

MIPS arithmetic Instructions

All inst ruct ions have 3 operands Operand order is fixed (destination fir st)

Example:

C code: A = B + C

M I PS code: add $s0, $s1, $s2

$si - indicate REGISTERS: storage inside CPUassociated with variables by compiler

Operands can be only registers: 32 registers provided

Principle of Regularity

38

Registers vs. Memory

Processor I/O

Control

Datapath

Memory

Input

Output

Ar ithmetic instruct ions operands must be registers, only 32 registers provided

Compiler associates variables with registers

What about programs with lots of variables ?


20/102

39

Memory Organization

Viewed as a large, single-dimension array, with an address. A memory address is an index into the array

"Byte addressing" means that the index points to a byte ofmemory.

0

1

2

3

4

5

6

...

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

40

Memory Organization

Bytes are nice, but most data items use larger "words"

For MIPS, a word is 32 bits or 4 bytes.

232 bytes with byte addresses from 0 to 232-1

230 words with byte addresses 0, 4, 8, ... 232-4

0

4

8

12

...

32 bits of data

32 bits of data

32 bits of data

32 bits of data

Registers hold 32 bits of data


21/102

41

More Instructions

MIPS loading words but addressing bytes arithmetic on registers only

Instruction Meaning

add $s1, $s2, $s3 $s1 = $s2 + $s3

sub $s1, $s2, $s3 $s1 = $s2 $s3

lw $s1, 100($s2) $s1 = Memory[$s2+100]

sw $s1, 100($s2) Memory[$s2+100] = $s1

42

Instructions

Load and store instructions

Example:

C code: A[8] = h + A[8];

MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0sw $t0, 32($s3)

Store word has destination last

Remember arithmetic operands are registers, not memory!


22/102

43

Our First Example

Can we figure out the code?

swap(int v[], int k);{ int temp;

temp = v[k]v[k] = v[k+1];v[k+1] = temp;

}

swap:muli $2, $5, 4add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

44

Instructions, like registers and words of data, are also 32 bits long

Example: add $t0, $s1, $s2

registers have numbers, $t0=9, $s1=17, $s2=18

Instruction Format:

op rs rt rd shamt funct

op: 6 bits opcode field

rs: 5 bits first register source operand

rt: 5 bits second register source operand

rd: 5 bits register destination operand

shamt: 5 bits shift amount to be used in shift instructionsfunct: 6 bits selects specific variant of the operation in

opcode field

This is R-type instruction format

Machine Language


23/102

45

Consider the load-word and store-word instructions,

What woul d the regularity princip le have us do?

New principle: Good design demands a compromise

Introduce a new type of inst ruction format

I-type for data transfer instructions

other format w as R-type for register

Example: lw $t0, 32($s2)

35 18 9 32

op rs rt 16 bit number

16 bit offset : +/-215 bytes of t he address in the base register rs

Where's the compromise?

Instructions o f fixed length but of different format to take care ofdifferent functional requirements

Multiple format complicates hardware

Machine Language

46

Decision making instructions

alter the control fl ow,

i.e., change the "next" instruct ion to be executed

MIPS conditional branch instructions:

bne $t0, $t1, Label

Go to statement at address Label if value in register t0 doesnot equal value in register t1

beq $t0, $t1, Label

Example: if (i==j) h = i + j;

bne $s0, $s1, Labeladd $s3, $s0, $s1

Label: ....

Control


24/102

47

MIPS unconditional branch instruct ions:j label

Example:

if (i!=j) beq $s4, $s5, Lab1h=i+j; add $s3, $s4, $s5

else j Lab2h=i-j; Lab1: sub $s3, $s4, $s5

Lab2: ...

Format of Branch instructions

Branch - I Format

Address corresponding to Label is given by 16 bit offset

Unconditional Branch Jump instruction

J format - 26 bits of offset

Control

48

Instructions:

bne $t4,$t5,Label Next instruction is at Label if $t4$t5beq $t4,$t5,Label Next instruction is at Label if $t4=$t5

Formats:

Use a register (like lw and sw) and add its content to address

use Instruction Address Register (PC = program counter)

most branches are local (princip le of locality)

Jump instructions just use high order bits of PC address boundaries of 256 MB

op rs rt 16 bit addressI

Addresses in Branches


25/102

49

So far:

Instruction Meaning

add $s1,$s2,$s3 $s1 = $s2 + $s3

sub $s1,$s2,$s3 $s1 = $s2 $s3

lw $s1,100($s2) $s1 = Memory[$s2+100]

sw $s1,100($s2) Memory[$s2+100] = $s1

bne $s4,$s5,L Next instr. is at Label if $s4 $s5

beq $s4,$s5,L Next instr. is at Label if $s4 = $s5

j Label Next instr. is at Label

Formats:


op rs rt 16 bit address

op 26 bit address

R

I

J

50

We have: beq, bne, what about Branch-if-less-than?

New instruction:if $s1 < $s2 then

$t0 = 1slt $t0, $s1, $s2 else

$t0 = 0

Can use this instruction to build Branch if less than

slt $t0,$s0,$s1

bne $t0,$Zero, Less

Register $Zero always contain 0

Control Flow


26/102

51

While loop

Example C- codewhile (save[i] == k)

i=i+j;

Corresponding MIPS code

Assume I, j, k corresponds to regis ters $3, $4, $5 respecti vely andbase of the array save is in $6.

Loop: add $t1,$s3,$s3 # reg $t1= 2*i

add $t1,$t1,$t1 # reg $t1= 4*I

add $t1,$t1,$s6 #$t1= address of save[i]

lw $t0, 0($t1)

bne $t0,$s5,Exit #go to Exit if save[i]!=k

add $s3,$s3,$s4 #i=i+j

j Loop #go to loopExit:

52

Register Use Conventions

Name egister numbe Usage

$zero 0 the constant value 0

$v0-$v1 2-3 values for results and expression ev

$a0-$a3 4-7 arguments

$t0-$t7 8-15 temporaries

$s0-$s7 16-23 saved

$t8-$t9 24-25 more temporaries

$gp 28 global pointer

$sp 29 stack pointer $fp 30 frame pointer

$ra 31 return address


27/102

53

Small constants are used quite frequently (50% of operands)e.g., A = A + 5;

B = B + 1;C = C - 18;

Mechanism

put 'typical constants' in memory alongwith instruct ions and loadthem.

create hard-wired registers (like $zero) for constants like one.

MIPS Instructions :

addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4

How do we make this work?

Instructions are I-type

16 bit field for the cons tant

Constants

54

We'd like to be able to load a 32 bit constant into a register

Must use two instruc tions, new "load upper immediate"instruction

lui $t0, 1010101010101010

Then must get the lower order bits right, i.e.,ori $t0, $t0, 1010101010101010

1010101010101010 0000000000000000

0000000000000000 1010101010101010

1010101010101010 1010101010101010

ori

1010101010101010 0000000000000000

filled with zeros

How about larger constants?


28/102

55

Supporting Procedures in Hardware

Execution of a Procedure Requires Place parameters so that procedures can access them

Transfer control to the procedure

Acqu ire storage resources for the procedure

Place the result so that calling procedure can access it

Return Control to the point o f origin

In MIPS architecture

$a0 - $a3 : four argument registers in which to passparameters

$v0 - $v1 : two value registers in which to pass parameters

$ra : one return address register to return the point oforigin

Special Instructionjal ProcedureAddress

Control jumps to the address and simultaneously saves theaddress of the following instruction in $ra

56

Assembly Language vs. Machine Language


29/102

57

Other Issues

58

simple instruct ions all 32 bits wide

very structured, no unnecessary baggage

only three instruction formats

rely on compi ler to achieve performance

what are the compiler's goals? help compiler where we can


op rs rt 16 bit address

op 26 bit address

R

I

J

Summarising


30/102

59

Byte Halfword Word

Registers

Memory

Memory

Word

Memory

Word

Register

Register

1. Immediate addressing

2. Register addressing

3. Base addressing

4. PC -relative addressing

5. Ps eudodirect addressing

op rs rt

op rs rt

op rs rt

op

op

rs rt

Address

Address

Address

rd . . . funct

Immediate

PC

PC

+

+

60

Design alternative:

provide more powerful operations

goal is to reduce number of instruct ions executed

danger is a slower cycle time and/or a higher CPI

Sometimes referred to as RISC vs. CISC

virtually all new instruct ion sets s ince 1982 have been

RISC

VAX: minimize code size, make assembly language easy

instructions from 1 to 54 bytes long!

Well look at PowerPC and 80x86

Alternative Architectures


31/102

61

PowerPC

Indexed addressing example: lw $t1,$a0+$s3 #$t1=Memory[$a0+$s3]

What do we have to do in MIPS?

Update addressing

update a register as part of load (for marching througharrays)

example: lwu $t0,4($s3)#$t0=Memory[$s3+4];$s3=$s3+4

What do we have to do in MIPS?

Others:

load multiple/store multiple a special counter register bc Loop

decrement counter, if not 0 goto loop

62

80x86

1978: The Intel 8086 is announced (16 bit architecture)

1980: The 8087 floating point coprocessor is added

1982: The 80286 increases address space to 24 bits,+instructions

1985: The 80386 extends to 32 bits , new addressing modes

1989-1995: The 80486, Pentium, Pentium Pro add a fewinstructions

(mostly designed for higher performance)

1997: MMX is added

This history illustrates the impact of the golden handcuffs ofcompatibility

adding new features as someone might add clothing to a packed bag

an architecture that is difficult to explain and impossible to love


32/102

63

A dominant archi tecture: 80x86

See your textbook for a more detailed description Complexity:

Instructions from 1 to 17 bytes long

one operand must act as both a source and destination

one operand can come from memory

complex addressing modese.g., base or scaled index with 8 or 32 bit d isplacement

Saving grace:

the most frequently used instructions are not too difficultto build

compilers avoid the portions of the architecture that are

slow

what the 80x86 lacks in style is made up in quantity,

making it beautiful from the right perspective

64

Instruction complexity is only one variable

lower instruct ion count vs. higher CPI / lower clock rate

Design Principles:

simplicity favors regularity

smaller is faster

good design demands compromise

make the common case fast

Instruction set architecture

a very important abstraction indeed!

Summary


33/102

65

Computer Arithmetic

66

Ari thmetic

Basic computation invo lves arithmetic operations

Instructions for arithmetic operations

Ar ithmetic Operations implemented in hardware

Ar ithmetic-Logical Uni t (ALU)

32

32

32

operation

result

a

b

ALU


34/102

67

Bits are just bits (no inherent meaning) conventions define relationship between bits and numbers

Binary numbers (base 2)0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...decimal: 0...2n-1

Problems:With fixed set of bits you can represent a finite set ofnumbers but actually set of possible numbers (evenintegers ) are infinitehow to represent fractions and real numbers?how to represent negative numbers?

Which bit patterns will represent which numbers?

Numbers

68

Sign Magnitude: One's Complement Two's Complement

000 = +0 000 = +0 000 = +0001 = +1 001 = +1 001 = +1010 = +2 010 = +2 010 = +2011 = +3 011 = +3 011 = +3100 = -0 100 = -3 100 = -4101 = -1 101 = -2 101 = -3110 = -2 110 = -1 110 = -2111 = -3 111 = -0 111 = -1

Issues: balance, number of zeros, ease of operations

Which one is best? Why?

Possible Representations of Negative Integers

2s complement ; consistent zero


35/102

69

32 bit signed numbers:

0000 0000 0000 0000 0000 0000 0000 0000two = 0ten0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten...

0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646ten0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647ten1000 0000 0000 0000 0000 0000 0000 0000two = 2,147,483,648ten1000 0000 0000 0000 0000 0000 0000 0001two = 2,147,483,647ten1000 0000 0000 0000 0000 0000 0000 0010two = 2,147,483,646ten...

1111 1111 1111 1111 1111 1111 1111 1101two = 3ten1111 1111 1111 1111 1111 1111 1111 1110two = 2ten1111 1111 1111 1111 1111 1111 1111 1111two = 1ten

maxint

minint

MIPS

70

Negating a two 's complement number:

invert all bits and add 1

remember: negate and invert are quite different!

Converting n bit numbers into numbers with more than n bits:

MIPS 16 bit immediate gets converted to 32 bits for arithmetic

copy the most sign ificant bit (the sign bit) into the other bits

0010 -> 0000 0010

1010 -> 1111 1010

"si gn extension" (lbu vs. lb)

Two's Complement Operations


36/102

71

Just like in school0111 0111 0110

+ 0110 - 0110 - 0101

Two's complement operations easy

subtraction using addition of negative numbers

0111

+ 1010

Overflow (result too large for finite computer word):

e.g., adding two n-bit numbers does not yield an n-bit number

0111

+ 0001 note that overflow term is somewhat misleading,

1000 it does not mean a carry overflowed

(becomes negative!)

Addit ion & Subtract ion

72

No overflow when adding a pos itive and a negative number

No overflow when signs are the same for subtraction

Overflow occurs when the value affects the sign:

overflow when adding two positives yields a negative

or, adding two negatives gives a positive

or, subtract a negative from a positive and get a negative

or, subtract a positi ve from a negative and get a positi ve

Consider the operations A + B, and A B

Can overflow occur if B is 0 ?

Can overflow occur if A is 0 ?

Detecting Overflow


37/102

73

An except ion (interrupt) occurs in MIPS Control jumps to predefined address for exception

Interrupted address is saved for possible resumption

Handling based on requirements of the software

Don't always want to detect overflow

Unsigned arithmetic new MIPS instructions: addu, addiu, subunote: addiu still sign-extends!

note: sltu, sltiu for unsigned comparisons

Effects of Overflow

74

Bit-wise AND, OR, Invert

Shift left

Shift right

Addi tional Operat ions

Bit -wise X-OR, X-NOR

Logical Operations


38/102

75

76

Let's build an ALU to support the andi and ori instructions

we'll just bu ild a 1 bit ALU, and use 32 of them

Possible Implementation (sum-of-products):

b

a

operation

result

op a b res

An ALU (arithmetic logic unit )


39/102

77

Selects one of the inputs to be the output , based on a controlinput

Lets build our ALU using a MUX:

S

CA

B0

1

The Multiplexor

note: we call this a 2-input mux

even though it has 3 inputs!

78

Desirable Features

Do not want too many inputs to a single gate

Do not want to have to go through too many gates

Let's look at a 1-bit ALU for addition:

How could we build a 1-bit ALU foradd, and, and or?

How could w e build a 32-bit ALU?

Different Implementations

cout = a b + a cin + b cinsum = a xor b xor cin

Sum

CarryIn

CarryOut

a

b


40/102

79

Building a 32 bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1

a1

b1

Result2

a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31CarryIn

80

Two's complement approch: just negate b and add.

How do we negate?

A very clever solution:

What about subtraction (a b) ?

0

2

Result

O peration

a

1

C arryIn

CarryOut

0

1

Binvert

b


41/102

81

Need to support the set-on-less-than instruction (slt)

remember: slt is an arithmetic instruc tion

produces a 1 if rs < rt and 0 otherwise

use subt raction: (a-b) < 0 impl ies a < b

Need to support test for equality (beq $t5, $t6, $t7)

use subt raction: (a-b) = 0 impl ies a = b

Adding more Operations to ALU

82

Supporting slt

Can we figure out the idea?0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

S et

Ove rflowdetection

Overflow

a.

b.

Set output if a


42/102

83

S e ta 3 1

0

A L U 0R e s u l t0

C a r ry I n

a 0

R e s u l t1a 1

0

R e s u l t2a 2

0

O p e r a tio n

b 3 1

b 0

b 1

b 2

R e s u l t3 1

O v e r flo w

B in v e r t

C a r r y In

L e s s

C a r ry I n

C a r ry O u t

A L U 1L e s s

C a r ry I n

C a r ry O u t

A L U 2L e s s

C a r ry I n

C a r ry O u t

A L U 3 1L e s s

C a r ry I n

84

Test for equality

Notice control lines:

000 = and

001 = or

010 = add

110 = subtract

111 = slt

Note: zero is a 1 when the result is zero!

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2

Less

CarryIn

CarryOut

ALU31Less

CarryIn


43/102

85

Recap

We can build an ALU to suppor t the MIPS instructi on set

key idea: use multiplexor to select the output we want

we can efficiently perform subtraction using twos complement

we can replicate a 1-bit ALU to p roduce a 32-bit ALU

Important points about hardware

all of the gates are always working

the speed of a gate is affected by the number of inputs to the gate

the speed of a circuit is affected by the number of gates in series

(on the critical path or the deepest level of logic )

Note

Clever changes to organization can improve performance(similar to using better algorithms in software)

well look at two examples for addition and multipl ication

86

Is a 32-bit ALU as fast as a 1-bit ALU?

Sequential dependence in the 32 bit ALU

Fast Carry carry computation in parallel

c1 = b0c0 + a0c0 + a0b0

c2 = b1c1 + a1c1 + a1b1 c2 =

c3 = b2c2 + a2c2 + a2b2 c3 =

c4 = b3c3 + a3c3 + a3b3 c4 =

Not feasible! Why?

large hardware requirement

Problem: ripple carry adder is slow


44/102

87

An approach in-between our two extremes Motivation:

If we didn't know the value of carry-in, what could we do?

When would we always generate a carry? gi = aibi When would we propagate the carry? pi = ai + bi Ci+1 = gi + pi.ci

When gi is 1, ci+1 = gi + pi.ci = 1 + pi.ci = 1

Adder generates Ci+1 independent of CiWhen gi=0 and pi=1, ci+1= 0 + 1.ci =Ci

Adder propagates

Did we get rid of the ripple?c1 = g0 + p0c0

c2 = g1 + p1c1 c2 =c3 = g2 + p2c2 c3 =

Feasible! Why?

Can use generate and propagate for larger buildingblocks 4-bit adder

Carry-Lookahead adder

88

Four 4-bit adders combinedto make a 16 bit adder

Carries come from Carrylookahead un it

Carry lookahead adder isfaster because carrygeneration and propagationlogic starts working themoment clock cycle begins;

carry goes through lessernumber of gates

Typically this 16-bit adder is6 times faster than ripplecarry adder

Build bigger adders

CarryIn

Result0--3

A LU0

CarryIn

Result4--7

A LU1

CarryIn

Result8--11

A LU2

CarryIn

Ca rryOut

Result12--15

A LU3

CarryIn

C 1

C 2

C 3

C 4

P 0G 0

P 1G 1

P 2G 2

P 3G 3

pigi

pi + 1gi + 1

ci + 1

ci + 2

ci + 3

ci + 4

pi + 2gi + 2

pi + 3gi + 3

a0 b0 a1 b1 a2 b2 a3 b3

a4 b4 a5 b5 a6 b6 a7 b7

a8 b8 a9 b9

a10

b10 a11 b11

a12 b12 a13 b13 a14 b14 a15 b15

Carry-lookahead unit


45/102

89

More complicated than addition accomplished via shifting and addition

More time and more area

Simplest Scheme

0010 (multiplicand)

__x_1011 (multiplier)

Negative numbers: convert and multiply

Multiplication

90

Multiplication implementation

D o n e

1 . Tes tM ultiplier0

1a. A dd m ul tip li cand to product and place the resu lt in Prod uct register

2. S hift the M ultipl ican d register left 1 b it

3. S hift the M ultipl ier register r ight 1 bit

32nd repetit ion?

S tart

M ultiplier0 = 0M ultiplier0 = 1

No : < 32 repetitions

Y e s : 3 2 re p etitio ns


46/102

91

Multiplication: Implementation

64-bit ALU

Control test

MultiplierShift right

ProductWrite

MultiplicandShift left

64 bits

64 bits

32 bits

92

2nd Version

D o n e

1 . T e s tM u l tip l i e r 0

1 a . A d d m u l ti p lic a n d t o t h e l e f t h a l f o f th e p r o d u c t a n d p l a c e t h e r e s u l t in t h e l e f t h a l f o f th e P r o d u c t r e g is t e r

2 . S h i f t th e P r o d u c t r e g i s te r r ig h t 1 b i t

3 . S h i ft th e M u l ti p l ie r r e g is t e r ri g h t 1 b i t

3 2 n d r e p e t iti o n ?

S t a rt

M u l tip l i e r 0 = 0M u l tip l i e r 0 = 1

N o : < 3 2 r e p e ti tio n s

Y e s : 3 2 r e p e ti ti o n s


47/102

93

Second Version

MultiplierShift right

Write

32 bits

64 bits

32 bits

Shift right

Multiplicand

32-bit ALU

Product Control test

94

Final Version

ControltestWrite

32bits

64bits

Shift rightProduct

Multiplicand

32-bit ALU

Done

1. TestProduct0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit

32nd repetition?

Start

Product0=0Product0=1

No:


48/102

95

Efficient Multiplication: Booths Multiplication

Motivation

Use of addition and subtraction permits product computationin a variety of ways

Eg. 2 X 6 0010 X 0110

6 = -2 + 8 0110 = -0010 + 1000

2x6 = -2x2 + 8x6

We can replace a string of 1s in the multiplier with an initialsubtract when we see a 1 and then later add when we see thebit after last 1

To reduce the number of additions(subtractions)

96

Booths Algorithm

Works with signed integers; twos complement form

Looks at two bits at a time scanning from right to left;

Steps

1. Depending on the current and previous b its, do

00 : Middle of a string of 0s, so no arithmetic operations

01 : End of a string of 1s, so add the multiplicand to the left half ofthe product

10 : Beginning of a string of 1s, so subtract the multiplicand from theleft half of the product

11 : Middle of a string of 1s so no arithmetic operation

Starts with a 0 for the imaginary bit to the right of the rightmostbit for the first stage

2. Shift the product register right 1 bit

Simulated Example


49/102

97

Booths Algorithm: Example

10011100 -100x 01100011 99------------------------------

00000000 00000000- 11111111 10011100

------------------------------00000000 01100100

+ 11111110 011100-----------------------------11111110 11010100

- 11110011 100----------------------------

00001011 01010100+ 11001110 0--------------------

11011001 01010100 -9900

Note that the multiplicandand multiplier are 8-bit tw o'scomplement number, butthe result is understood as16-bit two's comp lementnumber. Be careful aboutthe proper alignment of thecolumns. 10 pair causes asubtraction, aligned with 1,01 pair causes an addition,aligned with 0. In bothcases, it aligns wi th the oneon the left. The algorithmstarts w ith the 0-th bit. Weshould assume that there isa (-1)-th bit , having value 0

98

Booths Algorithm: Hardware

The hardware consists of 32-bit register M for the multiplicand, 64-bit

product register P, and a 1-bit register C, 32-bit ALU and control.

Initially, M contains multiplicand, P contains multiplier (the upper half

Ph = 0), and C contains bit 0. The algorithm is the following steps.

Repeat 32 times:

1.If (P0, C) pair is:

10: Ph = Ph - M,

01: Ph = Ph + M,

00: do nothing,

11: do nothing.

2.Arithmetic shift P right 1 bit. The shift-out bit gets into C.

Arithmetic shift preserves the sign of a two's complement number, thus

shift right arithmetic (sra) 0100 ... 111 -> 00100 ... 11 1100 ... 111 -> 11100 ...

11

Shift right arithmetic performed on P is equivalent to shift the multiplicand left with

sign extension.


50/102

99

Floating Point Numbers

We need a way to represent

numbers wi th fract ions , e.g., 3.1416

very small numbers, e.g., .000000001

very large numbers, e.g., 3.15576 109 Representation:

sign, exponent, significand: (1)sign significand 2exponent

more bits for sign ificand gives more accuracy

more bits for exponent increases range

IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand

double precision: 11 bit exponent, 52 bit signif icand

100

IEEE 754 floating-point standard

Leading 1 bit of significand is implicit

Exponent is biased to make sorting easier

all 0s is smallest exponent all 1s i s largest

bias of 127 for single precision and 1023 for double precision

summary: (1)sign (1+significand) 2exponent bias Example:

decimal: -.75 = -3/4 = -3/22

binary: -11/22 = -.11 = -1.1 x 2-1

floati ng poin t: exponent = 126 = 01111110

IEEE single precision: 10111111010000000000000000000000

Representation of Zero: all zero bits in the exponent is reserved andused for indi cating zero.

Pattern of all 1 bits in exponent to indicate values and situationsoutside the scope of representation


51/102

101

Floating Point Addition

Al ign the binary po int of the number wi th smal ler exponent byshifting the significand of the smaller number to the right(such that exponent of the smaller number matches the largerexponent)

Addi tion of the sign if icands

Normalize the result and accordingly adjust exponent (shiftingright and incrementing the exponent or shifting left anddecrementing the exponent)

Generate exception in case of underflow or overflow

If necessary, round (or truncate) the significand

102

Floating Point Multiplication

Add the biased exponents of the two numbers, subtract ing thebias from the sum to get the new biased exponent

Multiply the significands

Normalise the product if necessary by shifting right andincrementing the exponent

Round the significand

Set the sign of the products correctly


52/102

103

Floating Point instructions in MIPS

Addi tion, subt raction, mul tipl ication, div is ion, compar ison Single and double-precision

Separate floating point registers: $f0, $f1, $f2, .. and separateload and store for floating point registers

Registers used either as single or double-precision; a doub leprecision register is really an even-odd pair of sing le precisionregisters, using the even register as it s name

104

Accurate Ari thmetic ?

Floating point numbers, unlike integers, are approximations

Between 0 and 1 there are infinite number of real numbersout of which only 253 can be exactly represented in doubleprecision form

Rounding prov ides the mechanism for desired approximation

Extra bits required because if every intermediate result had tobe truncated to the exact number of d igits, then there wouldbe no opportunity to round

IEEE 754 keeps 2 extra bit s on the righ t dur ing intermediatecalculations called guard and round

A DECIMAL EXAMPLE:

with 2-digit significand and 2 extra digits - round and guard

2.56x100 +2.34x102

Normalisation: 2.3400 +0.0256 (5 in guard, 6 in round)

Result: 2.3656X102 after rounding 2.37X102

Without guard or round bit : 2.34 +0.02 =2.36x102


53/102

105

Floating Point Complexities: Summary

Operations are somewhat more complicated

In addition to overflow we can have underflow

Accuracy can be a b ig problem

IEEE 754 keeps two ext ra bits , guard and round

four rounding modes

positive divided by zero yields infinity

zero divide by zero yields not a number

other complexities

Implementing the standard can be tricky

Not using the standard can be even worse

106


54/102

107

108

Through implementation of a simplified version of MIPS

Simplified to contain only:

memory-reference instructions: lw, sw

arithmetic-logical instructions: add, sub, and, or, slt

control flow instructions:beq, j

Generic Implementation:

use the program counter (PC) to supply ins truction address

get the instruct ion from memory

read registers

use the instruction to decide exactly what to do Al l i ns tructions use the ALU after reading the reg isters

The Processor: Data-path & Control


55/102

109

Conceptual View of the Processor

Registers

Register #

Data

Register #

Datamemory

Address

Data

Register #

PC Instruction ALU

Instructionmemory

Address

Two types of functional units:

elements that operate on data values (combinational)

elements that contain state (sequential)

110

Unclocked vs. Clocked

Clocks used in synchronous logic

when shou ld an element that contains state be updated?

cycle time

rising edge

falling edge

Recap: State Elements


56/102

111

The set-reset latch output depends on present inputs and also on past inputs

An un-clocked state element

112

Output is equal to the stored value inside the element

Change of s tate (value) is based on the clock

Latches: whenever the inputs change, and the clock isasserted

Flip-flop: state changes only on a clock edge(edge-triggered methodology)

"logically true", could mean electrically low

A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written

Latches and Flip-flops


57/102

113

Two inputs: the data value to be stored (D)

the clock signal (C) indicating when to read & store D

Two outputs:

the value of the internal state (Q) and it's complement

D-latch

Q

C

D

_Q

D

C

Q

114

D flip-flop

Output changes only on the clock edge

QQ

_Q

Q

_Q

D latch

D

C

D latch

DD

C

C

D

C

Q


58/102

115

Our Implementation

An edge triggered methodology Typical execution:

read contents of some state elements,

send values through some combinational logic

write results to one or more state elements

Clock cycle

Stateelement

1Combinational logic

Stateelement

2

116

Built using D flip-flops

Register File

M ux

R egister 0

R egister 1

R egister n 1

Register n

M ux

R ead data 1

R ead data 2

R ead registernumber 1

R ead registernumber 2

Read register

number 1 Readdata 1

Readdata 2

Read registernumber 2

Register fileWriteregister

Writedata Write


59/102

117

Register File

Use the real clock to determine when to write

n-to-1decoder

R egister 0

R egister 1

Register n 1

C

CD

D

R egister n

C

C

D

D

R egister number

W rite

R egister data

0

1

n 1

n

118

Building the Datapath

Use multiplexors to stit ch functional components together

PC

Instructionmemory

Readaddress

Instruction

16 32

Add ALUresult

Mux

Registers

WriteregisterWritedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Shiftleft 2

4

Mux

ALU operation3

RegWrite

MemRead

MemWrite

PCSrc

ALUSrc

MemtoReg

ALUresult

ZeroALU

Datamemory

AddressWritedata

Readdata Mu

x

Signextend

Add


60/102

119

Control

Selecting the operations to perform (ALU, read/write, etc.)

Controlling the flow of data (multiplexor inputs)

Decode Information that comes from the 32 bits of the instruction

Example:

add $8, $17, $18 Instruction Format:

000000 10001 10010 01000 00000 100000


ALU's operat ion based on inst ruct ion type and function code

120

What should the ALU do with this inst ruction

Example: lw $1, 100($2)

35 2 1 100

op rs rt 16 bit offset

ALU cont ro l input

000 AND001 OR 010 add

110 subtract111 set-on-less-than

Why is the code for subtract 110 and not 011?

Control


61/102

121

Must describe hardware to compute 3-bit ALU conrol input given instruction type

00 = lw, sw01 = beq,11 = arithmetic

function code for arithmetic

Describe it using a truth table (can turn into gates):

ALUOpcomputed from instruction type

Control

ALUOp Func t f ield Operat ion

ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

0 0 X X X X X X 010

X 1 X X X X X X 110

1 X X X 0 0 0 0 0101 X X X 0 0 1 0 110

1 X X X 0 1 0 0 000

1 X X X 0 1 0 1 001

1 X X X 1 0 1 0 111

122

Control

Ins truc tion RegDst ALUSrc

Memto-

Re

Reg

Write

Mem

Read

Mem

Wr it e B ranc h A LUOp1 A LUp0

R-format 1 0 0 1 0 0 0 1 0l w 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1


62/102

123

PC

Instructionmemory

Readaddress

Instruction[310]

Instruction[2016]

Instruction[2521]

Add

Instruction[50]

MemtoReg

ALUOp

MemWrite

RegWrite

MemRead

Branch

RegDst

ALUSrc

Instruction[3126]

4

16 32Instruction[150]

0

0Mux

0

1

Control

Add ALUresult

Mux

0

1

RegistersWriteregister

Writedata

Readdata1

Readdata2

Readregister 1

Readregister 2

Signextend

Shiftleft2

Mux1

ALUresult

Zero

Datamemory

Writedata

Readdata

Mux

1

Instruction[1511]

ALUcontrol

ALUAddress

124

Control

Simple combinational logic (truth tables)

Operation2

Operation1

Operation0

Operation

ALUOp1

F3

F2

F1

F0

F (50)

ALUOp0

ALUOp

ALU control block

R -format Iw sw beq

Op0

Op1

Op2

Op3

Op4

Op5

Inputs

O utputs

RegDst

ALUSrc

MemtoReg

R egWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOpO


63/102

125

Al l of the logic is combinational

We wait for everything to settle down, and the right thing to be

done

ALU might not produce right answer righ t away

we use write signals along with clock to determine when to

write

Cycle time determined by length of the longest path

Our Simple Control Structure

We are ignoring some details like setup and hold times

126

Single Cycle Implementation Calculate cycle time assuming negligible delays except:

memory (2ns), ALU and adders (2ns), register file access(1ns)

MemtoReg

MemRead

MemWrite

ALUOp

ALUSrc

RegDst

PC

Instructionmemory

Readaddress

Instruction[31 0]

Instruction [2016]

Instruction [2521]

Add

Instruction [50]

RegWrite

4

16 32Instruction [150]

0

Registers

Writeregister

WritedataWritedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

Zero

Datamemory

Address Readdata

Mux1

0

Mux

1

0

Mu

x

1

0

Mux

1

Instruction [1511]

ALUcontrol

Shiftleft 2

PCSrc

ALU

AddALU

result


64/102

127

Analysis

Single Cycle Problems: what if we had a more complicated instruction like floating

point?

wasteful of area: repetetion of functional units if they areneeded more than once in an instruction

One Solution:

use a smaller cycle time

have different instructions take different numbers of cycles

a multicycle datapath:

128

Multi -Cycle Data Path

PC

Memory

Address

Instructionor data

Data

Instructionregister

Registers

Register #

Data

Register #

Register #

ALU

Memorydata

register

A

B

ALUOut


65/102

129

We will be reusing func tional units ALU used to compute address and to increment PC

Memory used for ins truction and data

Our control signals will not be determined solely byinstruction

e.g., what should the ALU do for a subtract instruction?

Well use a finite state machine for control

Multicycle Approach

130

Finite state machines:

a set of states and

next state function (determined by cur rent state and theinput)

output function (determined by current state and possiblyinput)

Well use a Moore machine (output based only on currentstate)

Review: fin ite state machines

Next-statefunction

Current state

Clock

Outputfunction

Nextstate

Outputs

Inputs


66/102

131

Break up the instruc tions in to steps, each step takes a cycle balance the amount of work to be done

restrict each cycle to use only one major functional unit

At the end of a cycle

store values for use in later cycles (easiest thing to do)

introduce additional internal registers

Multicycle Approach

132

Multi -Cycle Path

Shiftleft 2

PC

Memory

MemData

Writedata

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Mux

0

1

Mux

0

1

4

Instruction[150]

Signextend

3216

Instruction[2521]

Instruction[2016]

Instruction[150]

Instructionregister

1 Mux

0

3

2

Mux

ALUresult

ALU

Zero

Memorydata

register

Instruction[1511]

A

B

ALUOut

0

1

Address


67/102

133

Instruction Fetch

Instruction Decode and Register Fetch

Execution, Memory Address Computation, or Branch Completion

Memory Access or R-type instruction completion

Write-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Five Execution Steps

134

Use PC to get instruction and put it in the Instruction Register.

Increment the PC by 4 and pu t the result back in the PC.

Can be described succinctly using RTL " Register-TransferLanguage"

IR = Memory[PC];

PC = PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch


68/102

135

Read regis ters rs and rt in case we need them Compute the branch address in case the instruction is a branch

RTL:

A = Reg[IR[25-21]];

B = Reg[IR[20-16]];

ALUOut = PC + (sign-extend(IR[15-0])


69/102

137

Loads and stores access memory

MDR = Memory[ALUOut];

or

Memory[ALUOut] = B;

R-type instructions finish

Reg[IR[15-11]] = ALUOut;

The write actually takes place at the end of the cycle on the edge

Step 4 (R-type or memory-access)

138

Reg[IR[20-16]]= MDR;

What about all the other instructions?

Write-back step


70/102

139

Illustrations

Implementation of instructions in multi-cycle Add

Beq

J

140

Summary:

Step name

Action for R-type

instructions

Action for memory-reference

instructions

Action for

branches

Action for

umps

Instruction fetch IR =Memory[PC]

PC =PC +4

Instruction A =Reg [IR[25-21]]

decode/register fetch B =Reg [IR[20-16]]ALUOut =PC +(sign-extend (IR[15-0])


71/102

141

How many cycles will it take to execute this code?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...

What is going on during the 8th cycle of execution?

In what cycle does the actual addition of$t2 and $t3 takes place?

Simple Questions

142

Value of contro l signals is dependent upon:

what instruction is being executed

which step is being performed

Use the information weve acculumated to specify a finite statemachine

specify the finite state machine graphically, or

use microprogramming

Implementation can be derived from specification

Implementing the Control


72/102

143 How many state bits will we need?

FSM

P C W r ite P C S o u r c e = 1 0

A L U S r c A = 1A L U S r c B = 0 0

A L U O p = 0 1 P C W r it e C o n d

P C S o u r c e = 0 1

A L U S r c A =1 A L U S r c B = 0 0

A L U O p = 10

R e g D s t = 1 R e g W r it e

M e m t o R e g = 0

M e m W r ite I o r D = 1

M e m R e a d I o r D = 1

A L U S r c A = 1 A L U S r c B = 1 0 A L U O p = 0 0

R e g D s t = 0 R e g W r it e

M e m t o R e g = 1

A L U S r cA = 0 A L U S r c B = 1 1

A L U O p = 0 0

M e m R e a d A L U S rc A = 0

I o r D = 0 I R W r it e

A L U S r c B = 0 1A L U O p = 0 0

P C W r i te P C S o u r c e = 0 0

Instruct ion fetchIn s t r u c ti o n d eco d e /

reg ister fetch

J u m p co mp l e t i o n

B r a n c h co mp l e t i o nE xecu t i o n

M e m o r y a d d re s s co mp u ta t i o n

M e m o r ya c c e s s

M e m o r ya c c e s s R - typ e co m p l e t io n

W r ite -b ack s tep

(Op='L

W ')or (O

p ='SW

') (Op =

R-typ

e)

(Op

='B

EQ')

(Op='J')

(Op='S

W')

(Op='LW')

4

01

9862

753

S ta r t

144

Implementation:

Finite State Machine for Control

PCWrite

PCWriteCond

IorD

MemtoReg

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

NS3

NS2NS1

NS0

Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

State register

IRWrite

MemRead

MemWrite

Instruction registeropcode field

Outputs

Control logic

Inputs


73/102

145

PLA Implementation

O p 5

O p 4

O p 3

O p 2

O p 1

O p 0

S 3

S 2

S 1

S 0

I o r D

I R W r ite

M e m R e a d

M e m W r ite

P C W r ite

P C W r ite C o n d

M e m to R e g

P C S o u rc e 1

A L U O p 1

A L U S rc B 0A L U S rc A

R e g W r ite

R e g D s t

N S 3

N S 2

N S 1

N S 0

A L U S rc B 1

A L U O p 0

P C S o u rc e 0

146

ROM = "Read Only Memory"

values of memory locations are fixed ahead of t ime

A ROM can be used to imp lement a truth table

if the address is m-bits, we can address 2m entries in the ROM.

our outpu ts are the bits of data that the address points to.

ROM Implementation

m n

0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 01 0 0 0 0 0 0

1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1


74/102

147

How many inputs are there?6 bits fo r opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 dif ferent addresses)

How many outputs are there?16 datapath-contro l outputs, 4 state bits = 20 outputs

ROM is 210 x 20 = 20K bits (and a rather unusual size)

Rather wasteful, since for lots o f the entries, the outputs arethe same

i.e., opcode is often ignored

ROM Implementation

148

Break up the table into two parts

4 state bits tell you the 16 outputs , 24 x 16 bits of ROM

10 bits tell you the 4 next state bits, 210 x 4 bits of ROM

Total: 4.3K bits of ROM

PLA is much smaller

can share product terms

only need entries that produce an active output

can take into account don't cares

Size is (#inputs #product-terms) + (#outputs #product-terms)For this example = (10x17)+(20x17) = 460 PLA cells

PLA cells usually about the size of a ROM cell (slightl y bigger)

ROM vs PLA


75/102

149

Complex instructions : the "next state" is often current state +1

Another Implementation Style

AddrCtl

Outputs

PLA or ROM

State

Address select logic

Op[5

0]

Adder

Instruction registeropcodefield

1

Control unit

Input

PCWrite

PCWriteCond

IorD

MemtoReg

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

IRWrite

MemRead

MemWrite

BWrite

150

Details

Dispatch ROM 1 Dispatch ROM 2

Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 l w 0011

000010 j mp 1001 101011 sw 0101

000100 beq 1000

100011 l w 0010

101011 sw 0010

State number Address-control action Value of AddrCtl0 Use incremented state 3

1 Use dispatch ROM 1 1

2 Use dispatch ROM 2 2

3 Use incremented state 3

4 Replace state number by 0 0


6 Use incremented state 3




State

Op

Adder

1

PLA or ROM

Mux

3 2 1 0

Dispatch ROM 1Dispatch ROM 2

0

AddrCtl


Instruction registeropcode field


76/102

151

Microprogramming

What are the microinstructions ?

PCWrite

PCWriteCondIorD

MemtoReg

PCSourceALUOp

ALUSrcB

ALUSrcARegWrite

AddrCtl

Outputs

Microcode memory

IRWrite

MemRead

MemWrite

RegDst

Control unit

Input

Microprogram counter


Op[50]

Adder

1

Datapath

Instruction register

opcode field

BWrite

152

A specif ication methodology

appropriate if hundreds of opcodes, modes, cycles, etc.

signals specified symbolically using microinstructions

Will two implementations of the same architecture have the same microcode?

What would a microassembler do?

Microprogramming

Label

ALU

control SRC1 SRC2

Register

control Memory

PCWrite

control Sequencing

Fetch Add PC 4 Read PC ALU Seq

Add PC Extshft Read Dispatch 1

Mem1 Add A Extend Dispatch 2

LW2 Read ALU Seq

Write MDR Fetch

SW2 Write ALU Fetch

Rformat1 Func code A B Seq

Write ALU FetchBEQ1 Subt A B ALUOut-cond Fetch

J UMP1 J ump address Fetch


77/102

153

Microinstruction formatField name Value Signals active Comment

Add ALUOp =00 Cause the ALU to add.

ALU control Subt ALUOp =01 Cause the ALU to subtract; this implements the compare for

branches.

Func code ALUOp =10 Use the instruction's function code to determine ALU control.SRC1 PC ALUSrcA =0 Use the PC as the first ALU input.

A ALUSrcA =1 Register A is the first ALU input.

B ALUSrcB =00 Register B is the second ALU input.

SRC2 4 ALUSrcB =01 Use 4 as the second ALU input.

Extend ALUSrcB =10 Use output of the sign extension unit as the second ALU input.

Extshft ALUSrcB =11 Use the output of the shift-by-two unit as the second ALU input.

Read Read two registers using the rs and rt fields of the IR as the register

numbers and putting the data into registers A and B.

Write ALU RegWrite, Write a register using the rd field of the IR as the register number and

Register RegDst =1, the contents of the ALUOut as the data.

control MemtoReg =0

Write MDR RegWrite, Write a register using the rt field of the IR as the register number and

RegDst =0, the contents of the MDR as the data.

MemtoReg =1

Read PC MemRead, Read memory using the PC as address; write result into IR (and

lorD =0 the MDR).

Memory Read ALU MemRead, Read memory us ing the ALUOut as address; write result into MDR.

lorD =1

Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as the

lorD =1 data.

ALU PCS ource =00 Write the output of the ALU into the PC.

PCWrite

PC write control ALUOut-cond PCSource =01, If the Zero output of the ALU is active, write the PC with the contentsPCWriteCond of the register ALUOut.

jump address PCSource =10, Write the PC with the jump address from the instruction.

PCWrite

Seq AddrCtl =11 Choose the next microinstruction sequentially.

Sequencing Fetch AddrCtl =00 Go to the first microinstruction to begin a new instruction.

Dispatch 1 AddrCtl =01 Dispatch us ing the ROM 1.

Dispatch 2 AddrCtl =10 Dispatch using the ROM 2.

154

No encoding:

1 bit fo r each datapath operation

faster, requires more memory (logic)

used for Vax 780 an astonishing 400K of memory!

Lots of encoding:

send the microinstructions through logic to get control signals

uses less memory, slower

Historical context of CISC:

Too much logic to put on a single chip with everything else

Use a ROM (or even RAM) to hold the microcode Its easy to add new instructions

Maximally vs. Minimally Encoded


78/102

155

Microcode: Trade-offs

Distinction between specification and implementation is sometimes blurred

Specification Advantages:

Easy to design and wri te

Design architecture and m icrocode in parallel

Implementation (off-chip ROM) Advantages

Easy to change since values are in memory

Can emulate other architectures

Can make use of internal registers

Implementation Disadvantages, SLOWER now that:

Control is implemented on same chip as processor

ROM is no longer faster than RAM

No need to go back and make changes

156

The Big Picture

Initialrepresentation

Finite statediagram

Microprogram

Sequencingcontrol

Explicit nextstate function

Microprogram counter+ dispatch R OMS

Logicrepresentation

Logicequations

Truthtables

Implementationtechnique

P rogrammablelogic array

Read onlymemory


79/102

157

158

SRAM:

value is stored on a pair of inverting gates

very fast but takes up more space than DRAM (4 to 6transistors)

DRAM:

value is stored as a charge on capacitor (must berefreshed)

very small bu t slower than SRAM (factor o f 5 to 10)

Memories: Review

B

A A

B

Word line

Pass transistor

Capacitor

Bit line


80/102

159

Users want large and fast memories!

SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.

DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per

Mbyte.

Try and give it to them anyway

build a memory hierarchy

Exploiting Memory Hierarchy

1997

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance fromthe CPU in

access time

Size of the memory at each level

160

Locality

A p rincip le that makes having a memory h ierarchy a good idea

If an item is referenced,

temporal locality: it will tend to be referenced again soon

spatial locality: nearby items will tend to be referenced soon.

Why does code have locality?

Our initial focus: two levels (upper, lower)

block: minimum unit of data

hit: data requested is in the upper level

miss: data requested is not in the upper level


81/102

161

Caches, Memory and Processor

CPUcache

controller cache

main

memory

data

data

address

data

address

162

Two issues:

How do we know if a data item is in the cache?

If it is, how do we find it?

Our first example:

block s ize is one word o f data

"direct mapped"

For each item of data at the lower level,there is exactly one location in the cache where it might be.

e.g., lots of items at the lower level share locations in the upper level

Cache


82/102

163

Cache operation

Many main memory locations are mapped onto one cacheentry.

May have caches for:

instructions;

data;

data + instruc tions (unified).

Memory access time is no longer deterministic.

164

Terms

Cache hit: required location is in cache.

Cache miss: required location is not in cache.

Working set: set of locations used by program in a timeinterval.


83/102

165

Types of misses

Compulsory (cold): location has never been accessed. Capacity: working set is too large.

Conflict: multiple locations in working set map to same cacheentry.

166

Memory system performance

h = cache hit rate.

tcache = cache access time, tmain = main memory access time.

Average memory access t ime:

tav = htcache + (1-h)tmain


84/102

167

Multiple levels of cache

CPU L1 cache L2 cache

168

Multi-level cache access t ime

h1 = cache hit rate.

h2 = rate for miss on L1, hit on L2.

Average memory access t ime:

tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain


85/102

169

Cache performance benefits

Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.

Sequential accesses are faster after first access.

170

Replacement policies

Replacement poli cy: strategy for choosing which cache entryto throw out to make room for a new memory location.

Two popular st rategies:

Random.

Least-recently used (LRU).


86/102

171

Write operations

Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is

removed from cache.

172

Cache organizations

Fully-associative: any memory location can be storedanywhere in the cache (almost never imp lemented).

Direct-mapped: each memory location maps onto exactly onecache entry.

N-way set-associative: each memory location can go into oneof n sets.


87/102

173

Mapping: address is modulo the number of blocks in thecache

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

010

011

100

101

110

111

174

Direct Mapped Cache

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

10221023

Ta g

Index

Hit Data

20 32

31 30 13 12 11 2 1 0


88/102

175

Direct-mapped cache

valid

=

tag index offset

hit value

tag data

1 0xabcd byte byte byte ...

byte

cache block

176

Taking advantage of spatial locality

Direct Mapped Cache

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 10


89/102

177

Read hits this is w hat we want!

Read misses

stall the CPU, fetch block from memory, deliver to cache, restart

Write hits:

can replace data in cache and memory (write-through)

write the data only i nto the cache (write-back the cache later)

Write misses:

read the entire block i nto the cache, then write the word

Hits vs. Misses

178

Make reading mul tiple words easier by using banks of memory

It can get a lot more complicated...

Hardware Issues

CPU

Cache

Bus

Memory

a. One- word-widememory organization

CPU

Bus

b. Wide memory organization

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization


90/102

179

Increasing the block size tends to decrease miss rate:

Use split caches because there is more spatial locality in code:

Performance

1 KB8 KB16 KB64 KB256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Missrate

64164

Block size (bytes)

ProgramBlock size in

wordsInstructionmiss rate

Data missrate

Effective combinedmiss rate

gcc 1 6.1% 2.1% 5.4%

4 2.0% 1.7% 1.9%spice 1 1.2% 1.3% 1.2%

4 0.3% 0.6% 0.4%

180

Direct-mapped cache locations

Many locations map onto the same cache block.

Conflict misses are easy to generate:

Array a[] uses locations 0, 1, 2,

Array b[ ] uses locat ions 1024, 1025, 1026,

Operation a[i] + b[i] generates conflic t misses.


91/102

181

Performance

Simplified model:

execution time = (execution cycles + stall cycles) cycle timestall cycles = # of instruc tions miss ratio miss penalty

Two ways of improving performance:

decreasing the miss ratio

decreasing the miss penalty

What happens if we increase block size?

182

Compared to direct mapped, give a series of references that:

results in a lower miss ratio using a 2-way set associative cache

results in a higher miss ratio using a 2-way set associative cache

assuming we use the least recently used replacement strategy

Decreasing miss ratio with associativity

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Set

0

1

Tag Data

(direct mapped)

Block

0

7

1

2

3

4

5

6

Tag Data

Two-way set associative

Set

0

1

2

3

Tag Data


92/102

183

Set-associative cache

A set of direct -mapped caches:

Set 1 Set 2 Set n...

hit data

184

An implementation

22 8

V TagIndex

0

1

2

253

254

255

Data V Tag Data V T ag Data V T ag Data

3222

4-to-1 multiplexor

H it Data

123891011123031 0


93/102

185

Performance

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo -wayOne-way1 K B 2 K B 4 K B 8 K B

Missrate

Associativity 16 KB 32 KB 64 KB 128 KB

186

Example: direct-mapped vs. set-associative

address data

000 0101

001 1111

010 0000

011 0110

100 1000

101 0001

110 1010

111 0100


94/102

187

Direct-mapped cache behavior

After 001 access:block tag data

00 - -

01 0 1111

10 - -

11 - -


00 - -

01 0 1111

10 0 0000

11 - -

188

Direct-mapped cache behavior, contd.

After 011 access:

block tag data

00 - -

01 0 1111

10 0 0000

11 0 0110

After 100 access:

block tag data

00 1 1000

01 0 1111

10 0 0000

11 0 0110


95/102

189

Direct-mapped cache behavior, contd.


00 1 1000

01 1 0001

10 0 0000

11 0 0110


00 1 1000

01 1 0001

10 0 0000

11 1 0100

190

2-way set-associtive cache behavior

Final state of cache (twice as big as direct-mapped):

set blk 0 tag blk 0 data blk 1 tag blk 1 data

001 1000 - -

010 1111 1 0001

100 0000 - -

110 0110 1 0100


96/102

191

2-way set-associative cache behavior

Final state of cache (same size as direct-mapped):set blk 0 tag blk 0 data blk 1 tag blk 1 data

0 01 0000 10 1000

1 10 0111 11 0100

192

Decreasing miss penalty w ith mul tilevel caches

Add a second level cache:

often primary cache is on the same chip as the processor

use SRAMs to add another cache above primary memory (DRAM)

miss penalty goes dow n if data is in 2nd level cache

Example:

CPI of 1.0 on a 500Mhz machine with a 5% miss rate, 200ns DRAM access

Adding 2nd level cache w ith 20ns access t ime decreases miss rate to 2%

Using multi level caches:

try and opt imize the hit time on the 1st level cache

try and opt imize the miss rate on the 2nd level cache


97/102

193

Example caches

194

Memory management units

Memory management unit (MMU) translates addresses:

CPUmain

memory

memory

management

unit

logical

addressphysical

address


98/102

195

Memory management tasks

Al lows programs to move in physical memory duringexecution.

Al lows virtual memory:

memory images kept in secondary storage;

images returned to main memory on demand duringexecution.

Page fault: request for location not resident in memory.

196

Address translation

Requires some sort of register/table to allow arbitrarymappings of log ical to physical addresses.

Two basic schemes:

segmented;

paged.

Segmentation and paging can be combined (x86).


99/102

197

Segments and pages

memory

segment 1

segment 2

page 1

page 2

198

Segment address translation

segment base address logical address

range

check

physical address

+

range

errorsegment lower bound

segment upper bound


100/102

199

Page address translation

page offset

page offset

page i base

concatenate

200

Page table organizations

flat tree

page descriptor

page

descriptor


101/102

201

Caching address translations

Large translation tables require main memory access. TLB: cache for address translation.

Typically small.

202

ARM memory management

Memory region types:

section: 1 Mbyte block;

large page: 64 kbytes;

small page: 4 kbytes.

An address is marked as sect ion-mapped or page-mapped.

Two-level translation scheme.


102/102

203

ARM address translat ion

offset1st index 2nd index

physical address

Translation table

base register

1st level tabledescriptor

2nd level tabledescriptor

concatenate

concatenate

Documents

Course-module-1 _Compatibility Mode_.pdf