Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Computer Architecture and Design – ECEN 350

Part 7

[Some slides adapted from M. Irwin, D. Paterson and others]

ALU Implementation

Review: Basic Hardware

c = a . bba

000

010

001

111

b

ac

b

ac

a c

c = a + bba

000

110

101

111

10

01

c = aa

a0

b1

cd

0

1

a

c

b

d

1. AND gate (c = a . b)

2. OR gate (c = a + b)

3. Inverter (c = a)

4. Multiplexor (if d = = 0, c = a; else c = b)

To warm up let's build a logic unit to support the and and or instructions for MIPS (32-bit registers)

We'll just build a 1-bit unit and use 32 of them

Possible implementation using a multiplexor :

A Simple Multi-Function Logic Unit

a

boutput

operationselector

Selects one of the inputs to be the output based on a control input

Implementation with a Multiplexor

b

0

1

Result

Operation

a

.

.

.

32 u

nit

s

Not easy to decide the best way to implement do not want too many inputs to a single gate do not want to have to go through too many gates (=

levels) Let's look at a 1-bit ALU for addition:

Implementations

cout = a.b + a.cin + b.cin

sum = a.b.cin + a.b.cin + a.b.cin + a.b.cin

= a b cin

Sum

CarryIn

CarryOut

a

b

exclusive or (xor)

Building a 1-bit Binary Adder

1 bit Full Adder

A

BS

carry_in

carry_out

S = A xor B xor carry_in

carry_out = AB v Acarry_in v Bcarry_in (majority function)

How can we use it to build a 32-bit adder?

How can we modify it easily to build an adder/subtractor?

A B carry_in carry_out S

0 0 0 0 0

0 0 1 0 1

0 1 0 0 1

0 1 1 1 0

1 0 0 0 1

1 0 1 1 0

1 1 0 1 0

1 1 1 1 1

1-bit Adder Logic

Half-adder with one xor gate

Full-adder from 2 half-adders andan or gate

Half-adder with the xor gate replacedby primitive gates using the equationAB = A.B +A.B

xor

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

Ripple-Carry Logic for 32-bit ALU

1-bit ALU for AND, OR and add

Multiplexor control line

Two's complement approach: just negate b and add. How do we negate?

recall negation shortcut : invert each bit of b and set CarryIn to least significant bit (ALU0) to 1

What about Subtraction (a – b) ?

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

Overflow Detection and Effects

Overflow: the result is too large to represent in the number of bits allocated

When adding operands with different signs, overflow cannot occur! Overflow occurs when

adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive

On overflow, an exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address (address of instruction causing the

overflow) is saved for possible resumption Don't always want to detect (interrupt on) overflow

Overflow Detection Overflow can be detected by:

Carry into MSB xor Carry out of MSB

Logic for overflow detection therefore can be put in

to ALU31

1

1

1 10

1

0

1

1

0

0 1 1 1

0 0 1 1+

7

3

0

1

– 6

1 1 0 0

1 0 1 1+

–4

– 5

71

0

New MIPS Instructions

Category Instr Op Code Example Meaning

Arithmetic(R & I format)

add unsigned 0 and 21 addu $s1, $s2, $s3 $s1 = $s2 + $s3

sub unsigned 0 and 23 subu $s1, $s2, $s3 $s1 = $s2 - $s3

add imm.unsigned

9 addiu $s1, $s2, 6 $s1 = $s2 + 6

Data Transfer

ld byte unsigned

24 lbu $s1, 25($s2) $s1 = Mem($s2+25)

ld half unsigned 25 lhu $s1, 25($s2) $s1 = Mem($s2+25)

Cond. Branch (I & R format)

set on less than unsigned

0 and 2b sltu $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0

set on less than imm unsigned

b sltiu $s1, $s2, 6 if ($s2<6) $s1=1else $s1=0

Sign extend – addiu, addiu, slti, sltiu Zero extend – andi, ori, xori Overflow detected – add, addi, sub

Tailoring the ALU to MIPS:Test for Less-than and Equality

Need to support the set-on-less-than instruction e.g., slt $t0, $t3, $t4 remember: slt is an R-type instruction that produces 1 if rs

< rt and 0 otherwise idea is to use subtraction: rs < rt rs – rt < 0. Recall msb

of negative number is 1 two cases after subtraction rs – rt:

if no overflow then rs < rt most significant bit of rs – rt = 1 if overflow then rs < rt most significant bit of rs – rt = 0

why? e.g., 5ten – 6ten = 0101 – 0110 = 0101 + 1010 = 1111 (ok!) -7ten – 6ten = 1001 – 0110 = 1001 + 1010 = 0011

(overflow!)

set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other Less bits are hardwired 0; so Less is the 32-bit result of slt

Supporting slt0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflowdetection

Overflow

a.

b.

Seta31

0

ALU0 Result0

CarryIn

a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

CarryIn

Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

1- bit ALU for the 31 least significant bits

1-bit ALU for the most significant bit

Extra set bit, to be routed to the Less input of the least significant 1-bitALU, is computed from the most significant Result bit and the Overflow bit(it is not the output of the adder as the figure seems to indicate)

Less input ofthe 31 most significant ALUsis always 0

32-bit ALU from 31 copies of ALU at top left and 1 copy of ALU at bottom left in the most significant position

Tailoring the ALU to MIPS:Test for Less-than and Equality

Need to support test for equality e.g., beq $t5, $t6, $t7

use subtraction: rs - rt = 0 rs = rt

do we need to consider overflow?

Supporting Test for Equality

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

ALU ResultZero

Overflow

a

b

ALU operation

CarryOut

ALU control lines

Bneg- Oper- Func-ate ation tion

0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt

Symbol representing ALU

Output is 1 only if all Result bits are 0

Combine CarryInto least significantALU and Binvert toa single control lineas both are alwayseither 1 or 0

32-bit MIPS ALU

Conclusion

We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using two’s

complement we can replicate a 1-bit ALU to produce a 32-bit ALU

Important points about hardware speed of a gate depends number of inputs (fan-in) to the

gate speed of a circuit depends on number of gates in series

(particularly, on the critical path to the deepest level of logic)

Speed of MIPS operations clever changes to organization can improve performance

(similar to using better algorithms in software) we’ll look at examples for addition, multiplication and

division

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

Ripple-Carry Logic for 32-bit ALU

1-bit ALU for AND, OR and add

Multiplexor control line

Is a 32-bit ALU as fast as a 1-bit ALU? Why? Is there more than one way to do addition? Yes:

one extreme: ripple-carry – carry ripples through 32 ALUs, slow!

other extreme: sum-of-products for each CarryIn bit – super fast!

CarryIn bits:

c1 = b0.c0 + a0.c0 + a0.b0

c2 = b1.c1 + a1.c1 + a1.b1

= a1.a0.b0 + a1.a0.c0 + a1.b0.c0 (substituting for c1) + b1.a0.b0 + b1.a0.c0 + b1.b0.c0 + a1.b1

c3 = b2.c2 + a2.c2 + a2.b2

= … = sum of 15 4-term products… How fast? But not feasible for a 32-bit ALU! Why? Exponential

complexity!!

Problem: Ripple-carry Adder is Slow

Note: ci is CarryIn bit into i th ALU;c0 is the forced CarryIn into theleast significant ALU

An approach between our two extremes Motivation:

if we didn't know the value of a carry-in, what could we do? when would we always generate a carry? (generate) gi = ai . bi

when would we propagate the carry? (propagate) pi = ai + bi

Express (carry-in equations in terms of generate/propagates)c1 = g0 + p0.c0

c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0

c3 = g2 + p2.c2 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0

c4 = g3 + p3.c3 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0

+ p3.p2.p1.p0.c0 Feasible for 4-bit adders – with wider adders unacceptable

complexity. solution: build a first level using 4-bit adders, then a second level

on top

Two-level Carry-lookahead Adder: First Level

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder

Propagate signals for each of the four 4-bit adder blocks:

P0 = p3.p2.p1.p0

P1 = p7.p6.p5.p4

P2 = p11.p10.p9.p8

P3 = p15.p14.p13.p12

Generate signals for each of the four 4-bit adder blocks:G0 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0

G1 = g7 + p7.g6 + p7.p6.g5 + p7.p6.p5.g4

G2 = g11 + p11.g10 + p11.p10.g9 + p11.p10.p9.g8

G3 = g15 + p15.g14 + p15.p14.g13 + p15.p14.p13.g12


CarryIn signals for each of the four 4-bit adder blocks (see earlier carry-in equations in terms of generate/propagates):

C1 = G0 + P0.c0

C2 = G1 + P1.G0 + P1.P0.c0

C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.c0

C4 = G3 + P3.G2 + P3.P2.G1 + P3.P2.P1.G0 + P3.P2.P1.P0.c0

C a rry In

R e s u lt0 - -3C a rry In

R e s u lt4 - -7C a rry In

R e s u lt8 - -1 1C a rry In

C a rry O u t

R e s u lt1 2 - - 1 5C a rry In

C 1

C 2

C 3

C 4

P 0G 0

P 1G 1

P 2G 2

P 3G 3

a 0b 0a 1b 1a 2b 2a 3b 3

a 4b 4a 5b 5a 6b 6a 7b 7

a 8b 8a 9b 9

a 1 0b 1 0a 1 1b 1 1

a 1 2b 1 2a 1 3b 1 3a 1 4b 1 4a 1 5b 1 5

Lo

gic

to

co

mp

ute

C1,

C2,

C3,

C4

4bAdder0

4bAdder1

4bAdder2

4bAdder3

16-bit carry-lookahead adder from four 4-bitadders and one carry-lookahead unit

Carry-lookaheadLogic

ALU0

ALU1

ALU2

ALU3

a1

a0

a2

a3

b1

b0

b2

b3

s0

s1

s2

s3

Lo

gic

to

co

mp

ute

c

1,

c2

, c

3,

c4

, P

0,

G0

Blow-up of 4-bit adder:(conceptually) consisting of four 1-bit ALUs plus logic to compute all CarryOut bits and one super generate and one super propagate bit.Each 1-bit ALU is exactly as for ripple-carry except c1, c2, c3 for ALUs 1, 2, 3 comes from the extra logic

CarryIn

Carry-lookahead Unit

Two-level carry-lookahead logic steps:1. compute pi’s and gi’s at each 1-bit ALU

2. compute Pi’s and Gi’s at each 4-bit adder unit

3. compute Ci’s in carry-lookahead unit

4. compute ci’s at each 4-bit adder unit

5. compute results (sum bits) at each 1-bit ALU


Multiplication

Standard shift-add method:

Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000

m bits x n bits = m+n bit product Binary makes it easy:

multiplier bit 1 => copy multiplicand (1 x multiplicand) multiplier bit 0 => place 0 (0 x multiplicand)

3 versions of multiply hardware & algorithms:

x

Shift-add Multiplier Version 1

Done

1. TestMultiplier0

1a. Add multiplicand to product andplace the result in Product register

2. Shift the Multiplicand register left 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Algorithm



64-bit ALU

Control test

MultiplierShift right

ProductWrite

MultiplicandShift left

64 bits

64 bits

32 bits

Done

1. TestMultiplier0




32nd repetition?

Start



Yes: 32 repetitions

Multiplicand register, product register, ALU are64-bit wide; multiplier register is 32-bit wide

Algorithm

32-bit multiplicand starts at right half of multiplicand register

Product register is initialized at 0

Shift-add Multiplier Version1

Done

1. TestMultiplier0




32nd repetition?

Start



Yes: 32 repetitions

Itera Step Multiplier Multiplicand Product-tion 0 init 0011 0000 0010 0000 0000 values 1 1a 0011 0000 0010 0000 0010 2 0011 0000 0100 0000 0010 3 0001 0000 0100 0000 0010

2 …

Example: 0010 * 0011:

Algorithm

Observations on Multiply Version 1

1 step per clock cycle nearly 100 clock cycles to multiply two

32-bit numbers Half the bits in the multiplicand register always 0 64-bit adder is wasted 0’s inserted to right as multiplicand is shifted left

least significant bits of product never change once formed

Intuition: instead of shifting multiplicand to left, shift product to right…


Done

1. TestMultiplier0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit


32nd repetition?

Start



Yes: 32 repetitions

Algorithm



MultiplierShift right

Write

32 bits

64 bits

32 bits

Shift right

Multiplicand

32-bit ALU

Product Control test

Done

1. TestMultiplier0




32nd repetition?

Start



Yes: 32 repetitions

Multiplicand register, multiplier register, ALUare 32-bit wide; product register is 64-bit wide;multiplicand adds to left half of product register

Algorithm

Product register is initialized at 0


Done

1. TestMultiplier0




32nd repetition?

Start



Yes: 32 repetitions

Itera Step Multiplier Multiplicand Product-tion 0 init 0011 0010 0000 0000 values 1 1a 0011 0010 0010 0000 2 0011 0010 0001 0000 3 0001 0010 0001 0000

2 …

Example: 0010 * 0011:

Algorithm


Each step the product register wastes space that exactly matches the current size of the multiplier

Intuition: combine multiplier register and product register…


ControltestWrite

32 bits

64 bits

Shift rightProduct

Multiplicand

32-bit ALU

Done

1. TestProduct0



32nd repetition?

Start

Product0 = 0Product0 = 1


Yes: 32 repetitionsNo separate multiplier register; multiplier placed on right side of 64-bit product register

Algorithm

Product register is initialized with multiplier on right


Done

1. TestProduct0



32nd repetition?

Start

Product0 = 0Product0 = 1


Yes: 32 repetitions

Itera Step Multiplicand Product-tion 0 init 0010 0000 0011 values 1 1a 0010 0010 0011 2 0010 0001 0001

2 …

Example: 0010 * 0011:

Algorithm


2 steps per bit because multiplier & product combined What about signed multiplication?

easiest solution is to make both positive and remember whether to negate product when done, i.e., leave out the sign bit, run for 31 steps, then negate if multiplier and multiplicand have opposite signs

Booth’s Algorithm is an elegant way to multiply signed numbers using same hardware – it also often quicker…

MIPS Notes MIPS provides two 32-bit registers Hi and Lo to hold a

64-bit product mult, multu (unsigned) put the product of two 32-bit

register operands into Hi and Lo: overflow is ignored by MIPS but can be detected by programmer by examining contents of Hi

mflo, mfhi moves content of Hi or Lo to a general-purpose register

Pseudo-instructions mul (without overflow), mulo (with overflow), mulou (unsigned with overflow) take three 32-bit register operands, putting the product of two registers into the third

Divide 1001 Quotient

Divisor 1000 1001010 Dividend –1000 10 101 1010 –1000 10 Remainder

standart method: see how big a multiple of the divisor can be subtracted, creating quotient digit at each step

Binary makes it easy first, try 1 * divisor; if too big, 0 * divisor

Dividend = (Quotient * Divisor) + Remainder 3 versions of divide hardware & algorithm:

Divide Version 1

Done

Test Remainder

2a. Shift the Quotient register to the left,setting the new rightmost bit to 1

3. Shift the Divisor register right 1 bit

33rd repetition?

Start

Remainder < 0


Yes: 33 repetitions

2b. Restore the original value by addingthe Divisor register to the Remainder

register and place the sum in theRemainder register. Also shift the

Quotient register to the left, setting thenew least significant bit to 0

1. Subtract the Divisor register from theRemainder register and place the result in the Remainder register

Remainder > 0–

Itera- Step Quotient Divisor Remaindertion 0 init 0000 0010 0000 0000 0111 1 1 0000 0010 0000 1110 0111 2b 0000 0010 0000 0000 0111 3 0000 0001 0000 0000 0111

2 … 3 4 5

Example: 0111 / 0010:

Algorithm

Divide Version 1

64-bit ALU

Controltest

QuotientShift left

RemainderWrite

DivisorShift right

64 bits

64 bits

32 bits

Done

Test Remainder

2a. Shift the Quotient register to the left,setting the new rightmost bit to 1

3. Shift the Divisor register right 1 bit

33rd repetition?

Start

Remainder < 0


Yes: 33 repetitions

2b. Restore the original value by addingthe Divisor register to the Remainder

register and place the sum in theRemainder register. Also shift the

Quotient register to the left, setting thenew least significant bit to 0

1. Subtract the Divisor register from theRemainder register and place the result in the Remainder register

Remainder > 0–

Divisor register, remainder register, ALU are64-bit wide; quotient register is 32-bit wide

Algorithm

32-bit divisor starts at left half of divisor register

Remainder register is initialized with the dividend at right

Why 33?

Quotient register isinitialized to be 0

Divide Version 1

Observations on Divide Version 1

Half the bits in divisor always 0 1/2 of 64-bit adder is wasted 1/2 of divisor register is wasted

Intuition: instead of shifting divisor to right, shift remainder to left…

Step 1 cannot produce a 1 in quotient bit – as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit)

Intuition: switch order to shift first and then subtract – can save 1 iteration…

Divide Version 2

Controltest

QuotientShift left

Write

32 bits

64 bits

32 bits

Shift left

Divisor

32-bit ALU

Remainder

Divisor register, quotient register, ALU are 32-bit wide; remainder register is 64-bit wide


Done. Shift left half of Remainder right 1 bit

Test Remainder

3a. Shift the Remainder register to the left, setting the new rightmost bit to 0.

32nd repetition?

Start

Remainder < 0


Yes: 32 repetitions

3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum

in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0

2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the

Remainder register

Remainder 0

1. Shift the Remainder register left 1 bit

–>

Also shift the Quotient register to the leftsetting to the new rightmost bit to 1.

Also shift the Quotient register to the leftsetting to the new rightmost bit to 0.

AlgorithmWhy this correction step?


Each step the remainder register wastes space that exactly matches the current size of the quotient

Intuition: combine quotient register and remainder register…

Divide Version 3


Test Remainder

3a. Shift the Remainder register to the left, setting the new rightmost bit to 1

32nd repetition?

Start

Remainder < 0


Yes: 32 repetitions




Remainder register

Remainder 0


–>

Write

32 bits

64 bits

Shift leftShift right

Remainder

32-bit ALU

Divisor

Controltest

No separate quotient register; quotient is entered on the right side of the 64-bit remainder register

Algorithm


Why this correction step?

Divide Version 3


Test Remainder

3a. Shift the Remainder register to the left, setting the new rightmost bit to 1

32nd repetition?

Start

Remainder < 0


Yes: 32 repetitions




Remainder register

Remainder 0


–>

Example: 0111 / 0010:

Itera- Step Divisor Remaindertion 0 init 0010 0000 0111 1 0010 0000 1110 1 2 0010 1110 1110 3b 0010 0001 1100

2 … 3 4

Algorithm


Same hardware as Multiply Version 3

Signed divide: make both divisor and dividend positive and perform

division negate the quotient if divisor and dividend were of

opposite signs make the sign of the remainder match that of the dividend this ensures always

dividend = (quotient * divisor) + remainder –quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2

– 1)

Divide generates the reminder in hi and the quotient in lo

div $s0, $s1 # lo = $s0 / $s1

# hi = $s0 mod $s1

Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file

MIPS Divide Instruction

As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0.

op rs rt rd shamt funct

MIPS Notes div rs, rt (signed), divu rs, rt (unsigned), with

two 32-bit register operands, divide the contents of the operands and put remainder in Hi register and quotient in Lo; overflow is ignored in both cases

pseudo-instructions div rdest, rsrc1, src2 (signed with overflow), divu rdest, rsrc1, src2 (unsigned without overflow) with three 32-bit register operands puts quotients of two registers into third

Floating Point

Review of Numbers• Computers are made to deal with numbers

• What can we represent in N bits?• Unsigned integers:

0 to 2N - 1• Signed Integers (Two’s Complement)

-2(N-1) to 2(N-1) - 1

Other Numbers• What about other numbers?

• Very large numbers? (seconds/century)3,155,760,00010 (3.1557610 x 109)

• Very small numbers? (atomic diameter)0.0000000110 (1.010 x 10-8)

• Rationals (repeating pattern) 2/3 (0.666666666. . .)

• Irrationals21/2 (1.414213562373. . .)

• Transcendentalse (2.718...), (3.141...)

• All represented in scientific notation

Scientific Notation (in Decimal)

6.0210 x 1023

radix (base)decimal point

mantissa exponent

• Normalized form: no leadings 0s (exactly one digit to left of decimal point)

• Alternatives to representing 1/1,000,000,000• Normalized: 1.0 x 10-9

• Not normalized: 0.1 x 10-8,10.0 x 10-10

Scientific Notation (in Binary)

1.0two x 2-1

radix (base)“binary point”

exponent

• Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers

• Declare such variable in C as float

mantissa

Floating Point Representation (1/2)• Normal format: +1.xxxxxxxxxxtwo*2yyyytwo

• Multiple of Word Size (32 bits)

031S Exponent30 23 22

Significand

1 bit 8 bits 23 bits

• S represents SignExponent represents y’sSignificand represents x’s

• Represent numbers as small as 2.0 x 10-38 to as large as 2.0 x 1038

Floating Point Representation (2/2)• What if result too large? (> 2.0x1038 )

• Overflow!• Overflow Exponent larger than represented in 8-bit Exponent field

• What if result too small? (>0, < 2.0x10-38 )• Underflow!• Underflow Negative exponent larger than represented in 8-bit Exponent field

• How to reduce chances of overflow or underflow?

Double Precision Fl. Pt. Representation• Next Multiple of Word Size (64 bits)

• Double Precision (vs. Single Precision)• C variable declared as double• Represent numbers almost as small as 2.0 x 10-308 to almost as large as 2.0 x 10308

• But primary advantage is greater accuracy due to larger significand

031S Exponent

30 20 19Significand

1 bit 11 bits 20 bitsSignificand (cont’d)

32 bits

IEEE 754 Floating Point Standard (1/4)• Single Precision, DP similar

• Sign bit: 1 means negative0 means positive

• Significand:• To pack more bits, leading 1 implicit for normalized numbers

• 1 + 23 bits single, 1 + 52 bits double• always true: Significand < 1 (for normalized numbers)

• Note: 0 has no leading 1, so reserve exponent value 0 just for number 0

IEEE 754 Floating Point Standard (2/4)• Want FP numbers to be used even if no

FP hardware; e.g., sort records with FP numbers using integer compares

• Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands

• Then want order:• Highest order bit is sign ( negative < positive)• Exponent next, so big exponent => bigger #• Significand last: exponents same => bigger #

IEEE 754 Floating Point Standard (3/4)• Negative Exponent?

• 2’s comp? 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)

0 1111 1111 000 0000 0000 0000 0000 00001/20 0000 0001 000 0000 0000 0000 0000 00002• This notation using integer compare of 1/2 v. 2 makes 1/2 > 2!

• Instead, pick notation 0000 0001 is most negative, and 1111 1111 is most positive• 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)

1/2 0 0111 1110 000 0000 0000 0000 0000 0000

0 1000 0000 000 0000 0000 0000 0000 00002

IEEE 754 Floating Point Standard (4/4)• Called Biased Notation, where bias is number subtracted to get real number• IEEE 754 uses bias of 127 for single prec.• Subtract 127 from Exponent field to get actual value for exponent

• 1023 is bias for double precision

• Summary (single precision):031

S Exponent30 23 22

Significand

1 bit 8 bits 23 bits• (-1)S x (1 + Significand) x 2(Exponent-127)

• Double precision identical, except with exponent bias of 1023

• Floating Point numbers approximate values that we want to use.

• IEEE 754 Floating Point Standard is most widely accepted attempt to standardize interpretation of such numbers• Every desktop or server computer sold since ~1997 follows these conventions

• Summary (single precision):031

S Exponent30 23 22

Significand

1 bit 8 bits 23 bits• (-1)S x (1 + Significand) x 2(Exponent-127)

• Double precision identical, bias of 1023

Example: Converting Binary FP to Decimal

• Sign: 0 => positive

• Exponent: • 0110 1000two = 104ten

• Bias adjustment: 104 - 127 = -23

• Significand:• 1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22

= 1.0ten + 0.666115ten

0 0110 1000 101 0101 0100 0011 0100 0010

• Represents: 1.666115ten*2-23 ~ 1.986*10-7

(about 2/10,000,000)

Converting Decimal to FP (1/3)• Simple Case: If denominator is an exponent of 2 (2, 4, 8, 16, etc.), then it’s easy.

• Show MIPS representation of -0.75• -0.75 = -3/4• -11two/100two = -0.11two

• Normalized to -1.1two x 2-1

• (-1)S x (1 + Significand) x 2(Exponent-127)

• (-1)1 x (1 + .100 0000 ... 0000) x 2(126-127)

1 0111 1110 100 0000 0000 0000 0000 0000

Converting Decimal to FP (2/3)• Not So Simple Case: If denominator is not an exponent of 2.

• Then we can’t represent number precisely, but that’s why we have so many bits in significand: for precision

• Once we have significand, normalizing a number to get the exponent is easy.

• So how do we get the significand of a neverending number?

Converting Decimal to FP (3/3)• Fact: All rational numbers have a repeating pattern when written out in decimal.

• Fact: This still applies in binary.

• To finish conversion:• Write out binary number with repeating pattern.

• Cut it off after correct number of bits (different for single v. double precision).

• Derive Sign, Exponent and Significand fields.

Example: Representing 1/3 in MIPS• 1/3

= 0.33333…10

= 0.25 + 0.0625 + 0.015625 + 0.00390625 + …

= 1/4 + 1/16 + 1/64 + 1/256 + …

= 2-2 + 2-4 + 2-6 + 2-8 + …

= 0.0101010101… 2 * 20

= 1.0101010101… 2 * 2-2

• Sign: 0• Exponent = -2 + 127 = 125 = 01111101• Significand = 0101010101…

0 0111 1101 0101 0101 0101 0101 0101 010

Representation for ± ∞• In FP, divide by 0 should produce ± ∞, not overflow.

• Why?• OK to do further computations with ∞ E.g., X/0 > Y may be a valid comparison

• IEEE 754 represents ± ∞• Most positive exponent reserved for ∞• Significands all zeroes

Representation for 0• Represent 0?

• exponent all zeroes• significand all zeroes too• What about sign?• +0: 0 00000000 00000000000000000000000• -0: 1 00000000 00000000000000000000000

• Why two zeroes?• Helps in some limit comparisons

Special Numbers

• What have we defined so far? (Single Precision)

Exponent Significand Object

0 0 0

0 nonzero ???

1-254 anything +/- fl. pt. #

255 0 +/- ∞

255 nonzero ???

Representation for Not a Number• What is sqrt(-4.0)or 0/0?

• If ∞ not an error, these shouldn’t be either.• Called Not a Number (NaN)• Exponent = 255, Significand nonzero

• Why is this useful?• Hope NaNs help with debugging?• They contaminate: op(NaN, X) = NaN

Floating Point Addition Algorithm:

Done

2. Add the significands

4. Round the significand to the appropriatenumber of bits

Still normalized?

Start

Yes

No

No

YesOverflow orunderflow?

Exception

3. Normalize the sum, either shifting right andincrementing the exponent or shifting left

and decrementing the exponent

1. Compare the exponents of the two numbers.Shift the smaller number to the right until itsexponent would match the larger exponent

Floating Point Addition Hardware:

0 10 1 0 1

Control

Small ALU

Big ALU

Sign Exponent Significand Sign Exponent Significand

Exponentdifference

Shift right

Shift left or right

Rounding hardware

Sign Exponent Significand

Increment ordecrement

0 10 1

Shift smallernumber right

Compareexponents

Add

Normalize

Round

Floating Point Multpication Algorithm:

2. Multiply the significands

4. Round the significand to the appropriatenumber of bits

Still normalized?

Start

Yes

No

No

YesOverflow orunderflow?

Exception

3. Normalize the product if necessary, shiftingit right and incrementing the exponent

1. Add the biased exponents of the twonumbers, subtracting the bias from the sum

to get the new biased exponent

Done

5. Set the sign of the product to positive if thesigns of the original operands are the same;

if they differ make the sign negative

MIPS Floating Point Architecture (1/4)

• Separate floating point instructions:• Single Precision:

add.s, sub.s, mul.s, div.s

• Double Precision:add.d, sub.d, mul.d,

div.d

• These are far more complicated than their integer counterparts

• Can take much longer to execute


• Problems:• Inefficient to have different instructions take vastly differing amounts of time.

• Generally, a particular piece of data will not change FP int within a program.

- Only 1 type of instruction will be used on it.

• Some programs do no FP calculations• It takes lots of hardware relative to integers to do FP fast


• 1990 Solution: Make a completely separate chip that handles only FP.

• Coprocessor 1: FP chip• contains 32 32-bit registers: $f0, $f1, …• most of the registers specified in .s and .d instruction refer to this set

• separate load and store: lwc1 and swc1(“load word coprocessor 1”, “store …”)

• Double Precision: by convention, even/odd pair contain one DP FP number: $f0/$f1, $f2/$f3, … , $f30/$f31

- Even register is the name


• Today, FP coprocessor integrated with CPU, or cheap chips may leave out FP HW

• Instructions to move data between main processor and coprocessors:

• mfc0, mtc0, mfc1, mtc1, etc.

• Appendix contains many more FP ops

Performance evaluation

Performance Metrics Purchasing perspective

given a collection of machines, which has the - best performance ?- least cost ?- best cost/performance?

Design perspective faced with design options, which has the

- best performance improvement ?- least cost ?- best cost/performance?

Both require basis for comparison metric for evaluation

Two Notions of “Performance”

Plane

Boeing 747

BAD/Sud Concorde

TopSpeedDC to Paris Passengers Throughput (pass

x mph)

610 mph6.5 hours 470 286,700

1350 mph3 hours 132 178,200

• Which has higher performance?• Time to deliver 1 passenger?• Time to deliver 400 passengers?

• In a computer, time for 1 job calledResponse Time or Execution Time

• In a computer, jobs per day calledThroughput or Bandwidth

• We will focus primarily on execution time for a single job

Performance Metrics

Normally we are interested in reducing Response time (aka execution time) – the time between the start and the completion of a task

Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

Additional Metrics Smallest/lightest Longest battery life Most reliable/durable (in space)

What is Time?

Straightforward definition of time: Total time to complete a task, including disk accesses, memory

accesses, I/O activities, operating system overhead, ... “real time”, “response time” or “elapsed time”

Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time)

“CPU execution time” or “CPU time” Often divided into system CPU time (in OS) and user CPU time (in

user program)

Performance Factors Want to distinguish elapsed time and the time spent on

our task

CPU execution time (CPU time) – time the CPU spends working on a task

Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cycles for a program for a program = x clock cycle

time

CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------

Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Many techniques that decrease the number of clock cycles also increase the clock cycle time

or

Review: Machine Clock Rate

Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)

CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate



1 nsec clock cycle => 1 GHz clock rate

500 psec clock cycle => 2 GHz clock rate



Clock Cycles per Instruction Not all instructions take the same amount of time to

execute One way to think about execution time is that it equals the

number of instructions executed multiplied by the average time per instruction

Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA

# CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x

CPI for this instruction class

A B C

CPI 1 2 3

Effective CPI

Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = (CPIi x ICi)i = 1

n

Where ICi is the count (percentage) of the number of instructions of class i executed

CPIi is the (average) number of clock cycles per instruction for that instruction class

n is the number of instruction classes

The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

THE Performance Equation Our basic performance equation is then

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count x CPI

clock_rate CPU time = -----------------------------------------------

or

These equations separate the three key factors that affect performance

Can measure the CPU execution time by running the program

The clock rate is usually given Can measure overall instruction count by using profilers/

simulators without knowing all of the implementation details

CPI varies by instruction type and ISA implementation for which we must know the implementation detailsinstruction count != the number of lines in the code

Determinates of CPU Performance

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

TechnologyX

XX

XX

X X

X

X

X

X

X

A Simple Example

How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

How does this compare with using branch prediction to shave a cycle off the branch time?

Op Freq CPIi Freq x CPIi

ALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

=

.5

1.0

.3

.4

2.2

CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

1.6

.5

.4

.3

.4

.5

1.0

.3

.2

2.0

CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

Comparing and Summarizing Performance

Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

How do we summarize the performance for benchmark set with a single number?

The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)

AM = 1/n Timeii = 1

n

Where Timei is the execution time for the ith program of a total of n programs in the workload

A smaller mean indicates a smaller average execution time and thus improved performance

SPEC Benchmarks www.spec.org

Integer benchmarks FP benchmarks

gzip compression wupwise Quantum chromodynamics

vpr FPGA place & route swim Shallow water model

gcc GNU C compiler mgrid Multigrid solver in 3D fields

mcf Combinatorial optimization applu Parabolic/elliptic pde

crafty Chess program mesa 3D graphics library

parser Word processing program galgel Computational fluid dynamics

eon Computer visualization art Image recognition (NN)

perlbmk perl application equake Seismic wave propagation simulation

gap Group theory interpreter facerec Facial image recognition

vortex Object oriented database ammp Computational chemistry

bzip2 compression lucas Primality testing

twolf Circuit place & route fma3d Crash simulation fem

sixtrack Nuclear physics accel

apsi Pollutant distribution

http://www.spec.org/

Example SPEC Ratings

Other Performance Metrics Power consumption – especially in the embedded market

where battery life is important (and passive cooling) For power-limited applications, the most important metric is

energy efficiency

Summary: Evaluating ISAs Design-time metrics:

Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation?

Static Metrics: How many bytes does the program occupy in memory?

Dynamic Metrics: How many instructions are executed? How many bytes does the

processor fetch to execute the program? How many clocks are required per instruction?

Best Metric: Time to execute the program! CPI

Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.

Example A certain computer has a CPU running at 500 MHz. Your PINE session

on this machine ends up executing 200 million instructions, of the following mix:

Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

What is the average CPI for this session?

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session


Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

What is the average CPI for this session? Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.



Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

How much CPU time was used in this session?



Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

How much CPU time was used in this session? Answer: (200 x 106 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) =

0.92 seconds.

Documents

Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]