100
Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Embed Size (px)

Citation preview

Page 1: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Computer Architecture and Design – ECEN 350

Part 7

[Some slides adapted from M. Irwin, D. Paterson and others]

Page 2: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

ALU Implementation

Page 3: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Review: Basic Hardware

c = a . bba

000

010

001

111

b

ac

b

ac

a c

c = a + bba

000

110

101

111

10

01

c = aa

a0

b1

cd

0

1

a

c

b

d

1. AND gate (c = a . b)

2. OR gate (c = a + b)

3. Inverter (c = a)

4. Multiplexor (if d = = 0, c = a; else c = b)

Page 4: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

To warm up let's build a logic unit to support the and and or instructions for MIPS (32-bit registers)

We'll just build a 1-bit unit and use 32 of them

Possible implementation using a multiplexor :

A Simple Multi-Function Logic Unit

a

boutput

operationselector

Page 5: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Selects one of the inputs to be the output based on a control input

Implementation with a Multiplexor

b

0

1

Result

Operation

a

.

.

.

32 u

nit

s

Page 6: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Not easy to decide the best way to implement do not want too many inputs to a single gate do not want to have to go through too many gates (=

levels) Let's look at a 1-bit ALU for addition:

Implementations

cout = a.b + a.cin + b.cin

sum = a.b.cin + a.b.cin + a.b.cin + a.b.cin

= a b cin

Sum

CarryIn

CarryOut

a

b

exclusive or (xor)

Page 7: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Building a 1-bit Binary Adder

1 bit Full Adder

A

BS

carry_in

carry_out

S = A xor B xor carry_in

carry_out = AB v Acarry_in v Bcarry_in (majority function)

How can we use it to build a 32-bit adder?

How can we modify it easily to build an adder/subtractor?

A B carry_in carry_out S

0 0 0 0 0

0 0 1 0 1

0 1 0 0 1

0 1 1 1 0

1 0 0 0 1

1 0 1 1 0

1 1 0 1 0

1 1 1 1 1

Page 8: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

1-bit Adder Logic

Half-adder with one xor gate

Full-adder from 2 half-adders andan or gate

Half-adder with the xor gate replacedby primitive gates using the equationAB = A.B +A.B

xor

Page 9: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

Ripple-Carry Logic for 32-bit ALU

1-bit ALU for AND, OR and add

Multiplexor control line

Page 10: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Two's complement approach: just negate b and add. How do we negate?

recall negation shortcut : invert each bit of b and set CarryIn to least significant bit (ALU0) to 1

What about Subtraction (a – b) ?

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

Page 11: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Overflow Detection and Effects

Overflow: the result is too large to represent in the number of bits allocated

When adding operands with different signs, overflow cannot occur! Overflow occurs when

adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive

On overflow, an exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address (address of instruction causing the

overflow) is saved for possible resumption Don't always want to detect (interrupt on) overflow

Page 12: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Overflow Detection Overflow can be detected by:

Carry into MSB xor Carry out of MSB

Logic for overflow detection therefore can be put in

to ALU31

1

1

1 10

1

0

1

1

0

0 1 1 1

0 0 1 1+

7

3

0

1

– 6

1 1 0 0

1 0 1 1+

–4

– 5

71

0

Page 13: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

New MIPS Instructions

Category Instr Op Code Example Meaning

Arithmetic(R & I format)

add unsigned 0 and 21 addu $s1, $s2, $s3 $s1 = $s2 + $s3

sub unsigned 0 and 23 subu $s1, $s2, $s3 $s1 = $s2 - $s3

add imm.unsigned

9 addiu $s1, $s2, 6 $s1 = $s2 + 6

Data Transfer

ld byte unsigned

24 lbu $s1, 25($s2) $s1 = Mem($s2+25)

ld half unsigned 25 lhu $s1, 25($s2) $s1 = Mem($s2+25)

Cond. Branch (I & R format)

set on less than unsigned

0 and 2b sltu $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0

set on less than imm unsigned

b sltiu $s1, $s2, 6 if ($s2<6) $s1=1else $s1=0

Sign extend – addiu, addiu, slti, sltiu Zero extend – andi, ori, xori Overflow detected – add, addi, sub

Page 14: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Tailoring the ALU to MIPS:Test for Less-than and Equality

Need to support the set-on-less-than instruction e.g., slt $t0, $t3, $t4 remember: slt is an R-type instruction that produces 1 if rs

< rt and 0 otherwise idea is to use subtraction: rs < rt rs – rt < 0. Recall msb

of negative number is 1 two cases after subtraction rs – rt:

if no overflow then rs < rt most significant bit of rs – rt = 1 if overflow then rs < rt most significant bit of rs – rt = 0

why? e.g., 5ten – 6ten = 0101 – 0110 = 0101 + 1010 = 1111 (ok!) -7ten – 6ten = 1001 – 0110 = 1001 + 1010 = 0011

(overflow!)

set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other Less bits are hardwired 0; so Less is the 32-bit result of slt

Page 15: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Supporting slt0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflowdetection

Overflow

a.

b.

Seta31

0

ALU0 Result0

CarryIn

a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

CarryIn

Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

1- bit ALU for the 31 least significant bits

1-bit ALU for the most significant bit

Extra set bit, to be routed to the Less input of the least significant 1-bitALU, is computed from the most significant Result bit and the Overflow bit(it is not the output of the adder as the figure seems to indicate)

Less input ofthe 31 most significant ALUsis always 0

32-bit ALU from 31 copies of ALU at top left and 1 copy of ALU at bottom left in the most significant position

Page 16: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Tailoring the ALU to MIPS:Test for Less-than and Equality

Need to support test for equality e.g., beq $t5, $t6, $t7

use subtraction: rs - rt = 0 rs = rt

do we need to consider overflow?

Page 17: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Supporting Test for Equality

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

ALU ResultZero

Overflow

a

b

ALU operation

CarryOut

ALU control lines

Bneg- Oper- Func-ate ation tion

0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt

Symbol representing ALU

Output is 1 only if all Result bits are 0

Combine CarryInto least significantALU and Binvert toa single control lineas both are alwayseither 1 or 0

32-bit MIPS ALU

Page 18: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Conclusion

We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using two’s

complement we can replicate a 1-bit ALU to produce a 32-bit ALU

Important points about hardware speed of a gate depends number of inputs (fan-in) to the

gate speed of a circuit depends on number of gates in series

(particularly, on the critical path to the deepest level of logic)

Speed of MIPS operations clever changes to organization can improve performance

(similar to using better algorithms in software) we’ll look at examples for addition, multiplication and

division

Page 19: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

Ripple-Carry Logic for 32-bit ALU

1-bit ALU for AND, OR and add

Multiplexor control line

Page 20: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Is a 32-bit ALU as fast as a 1-bit ALU? Why? Is there more than one way to do addition? Yes:

one extreme: ripple-carry – carry ripples through 32 ALUs, slow!

other extreme: sum-of-products for each CarryIn bit – super fast!

CarryIn bits:

c1 = b0.c0 + a0.c0 + a0.b0

c2 = b1.c1 + a1.c1 + a1.b1

= a1.a0.b0 + a1.a0.c0 + a1.b0.c0 (substituting for c1) + b1.a0.b0 + b1.a0.c0 + b1.b0.c0 + a1.b1

c3 = b2.c2 + a2.c2 + a2.b2

= … = sum of 15 4-term products… How fast? But not feasible for a 32-bit ALU! Why? Exponential

complexity!!

Problem: Ripple-carry Adder is Slow

Note: ci is CarryIn bit into i th ALU;c0 is the forced CarryIn into theleast significant ALU

Page 21: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

An approach between our two extremes Motivation:

if we didn't know the value of a carry-in, what could we do? when would we always generate a carry? (generate) gi = ai . bi

when would we propagate the carry? (propagate) pi = ai + bi

Express (carry-in equations in terms of generate/propagates)c1 = g0 + p0.c0

c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0

c3 = g2 + p2.c2 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0

c4 = g3 + p3.c3 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0

+ p3.p2.p1.p0.c0 Feasible for 4-bit adders – with wider adders unacceptable

complexity. solution: build a first level using 4-bit adders, then a second level

on top

Two-level Carry-lookahead Adder: First Level

Page 22: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder

Propagate signals for each of the four 4-bit adder blocks:

P0 = p3.p2.p1.p0

P1 = p7.p6.p5.p4

P2 = p11.p10.p9.p8

P3 = p15.p14.p13.p12

Generate signals for each of the four 4-bit adder blocks:G0 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0

G1 = g7 + p7.g6 + p7.p6.g5 + p7.p6.p5.g4

G2 = g11 + p11.g10 + p11.p10.g9 + p11.p10.p9.g8

G3 = g15 + p15.g14 + p15.p14.g13 + p15.p14.p13.g12

Page 23: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder

CarryIn signals for each of the four 4-bit adder blocks (see earlier carry-in equations in terms of generate/propagates):

C1 = G0 + P0.c0

C2 = G1 + P1.G0 + P1.P0.c0

C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.c0

C4 = G3 + P3.G2 + P3.P2.G1 + P3.P2.P1.G0 + P3.P2.P1.P0.c0

Page 24: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

C a rry In

R e s u lt0 - -3C a rry In

R e s u lt4 - -7C a rry In

R e s u lt8 - -1 1C a rry In

C a rry O u t

R e s u lt1 2 - - 1 5C a rry In

C 1

C 2

C 3

C 4

P 0G 0

P 1G 1

P 2G 2

P 3G 3

a 0b 0a 1b 1a 2b 2a 3b 3

a 4b 4a 5b 5a 6b 6a 7b 7

a 8b 8a 9b 9

a 1 0b 1 0a 1 1b 1 1

a 1 2b 1 2a 1 3b 1 3a 1 4b 1 4a 1 5b 1 5

Lo

gic

to

co

mp

ute

C1,

C2,

C3,

C4

4bAdder0

4bAdder1

4bAdder2

4bAdder3

16-bit carry-lookahead adder from four 4-bitadders and one carry-lookahead unit

Carry-lookaheadLogic

ALU0

ALU1

ALU2

ALU3

a1

a0

a2

a3

b1

b0

b2

b3

s0

s1

s2

s3

Lo

gic

to

co

mp

ute

c

1,

c2

, c

3,

c4

, P

0,

G0

Blow-up of 4-bit adder:(conceptually) consisting of four 1-bit ALUs plus logic to compute all CarryOut bits and one super generate and one super propagate bit.Each 1-bit ALU is exactly as for ripple-carry except c1, c2, c3 for ALUs 1, 2, 3 comes from the extra logic

CarryIn

Carry-lookahead Unit

Page 25: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Two-level carry-lookahead logic steps:1. compute pi’s and gi’s at each 1-bit ALU

2. compute Pi’s and Gi’s at each 4-bit adder unit

3. compute Ci’s in carry-lookahead unit

4. compute ci’s at each 4-bit adder unit

5. compute results (sum bits) at each 1-bit ALU

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder

Page 26: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Multiplication

Standard shift-add method:

Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000

m bits x n bits = m+n bit product Binary makes it easy:

multiplier bit 1 => copy multiplicand (1 x multiplicand) multiplier bit 0 => place 0 (0 x multiplicand)

3 versions of multiply hardware & algorithms:

x

Page 27: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 1

Done

1. TestMultiplier0

1a. Add multiplicand to product andplace the result in Product register

2. Shift the Multiplicand register left 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Algorithm

Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000

Page 28: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 1

64-bit ALU

Control test

MultiplierShift right

ProductWrite

MultiplicandShift left

64 bits

64 bits

32 bits

Done

1. TestMultiplier0

1a. Add multiplicand to product andplace the result in Product register

2. Shift the Multiplicand register left 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Multiplicand register, product register, ALU are64-bit wide; multiplier register is 32-bit wide

Algorithm

32-bit multiplicand starts at right half of multiplicand register

Product register is initialized at 0

Page 29: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version1

Done

1. TestMultiplier0

1a. Add multiplicand to product andplace the result in Product register

2. Shift the Multiplicand register left 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Itera Step Multiplier Multiplicand Product-tion 0 init 0011 0000 0010 0000 0000 values 1 1a 0011 0000 0010 0000 0010 2 0011 0000 0100 0000 0010 3 0001 0000 0100 0000 0010

2 …

Example: 0010 * 0011:

Algorithm

Page 30: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Observations on Multiply Version 1

1 step per clock cycle nearly 100 clock cycles to multiply two

32-bit numbers Half the bits in the multiplicand register always 0 64-bit adder is wasted 0’s inserted to right as multiplicand is shifted left

least significant bits of product never change once formed

Intuition: instead of shifting multiplicand to left, shift product to right…

Page 31: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 2

Done

1. TestMultiplier0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Algorithm

Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000

Page 32: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 2

MultiplierShift right

Write

32 bits

64 bits

32 bits

Shift right

Multiplicand

32-bit ALU

Product Control test

Done

1. TestMultiplier0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Multiplicand register, multiplier register, ALUare 32-bit wide; product register is 64-bit wide;multiplicand adds to left half of product register

Algorithm

Product register is initialized at 0

Page 33: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 2

Done

1. TestMultiplier0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Itera Step Multiplier Multiplicand Product-tion 0 init 0011 0010 0000 0000 values 1 1a 0011 0010 0010 0000 2 0011 0010 0001 0000 3 0001 0010 0001 0000

2 …

Example: 0010 * 0011:

Algorithm

Page 34: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Observations on Multiply Version 2

Each step the product register wastes space that exactly matches the current size of the multiplier

Intuition: combine multiplier register and product register…

Page 35: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 3

ControltestWrite

32 bits

64 bits

Shift rightProduct

Multiplicand

32-bit ALU

Done

1. TestProduct0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit

32nd repetition?

Start

Product0 = 0Product0 = 1

No: < 32 repetitions

Yes: 32 repetitionsNo separate multiplier register; multiplier placed on right side of 64-bit product register

Algorithm

Product register is initialized with multiplier on right

Page 36: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Shift-add Multiplier Version 3

Done

1. TestProduct0

1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

2. Shift the Product register right 1 bit

32nd repetition?

Start

Product0 = 0Product0 = 1

No: < 32 repetitions

Yes: 32 repetitions

Itera Step Multiplicand Product-tion 0 init 0010 0000 0011 values 1 1a 0010 0010 0011 2 0010 0001 0001

2 …

Example: 0010 * 0011:

Algorithm

Page 37: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Observations on Multiply Version 3

2 steps per bit because multiplier & product combined What about signed multiplication?

easiest solution is to make both positive and remember whether to negate product when done, i.e., leave out the sign bit, run for 31 steps, then negate if multiplier and multiplicand have opposite signs

Booth’s Algorithm is an elegant way to multiply signed numbers using same hardware – it also often quicker…

Page 38: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

MIPS Notes MIPS provides two 32-bit registers Hi and Lo to hold a

64-bit product mult, multu (unsigned) put the product of two 32-bit

register operands into Hi and Lo: overflow is ignored by MIPS but can be detected by programmer by examining contents of Hi

mflo, mfhi moves content of Hi or Lo to a general-purpose register

Pseudo-instructions mul (without overflow), mulo (with overflow), mulou (unsigned with overflow) take three 32-bit register operands, putting the product of two registers into the third

Page 39: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide 1001 Quotient

Divisor 1000 1001010 Dividend –1000 10 101 1010 –1000 10 Remainder

standart method: see how big a multiple of the divisor can be subtracted, creating quotient digit at each step

Binary makes it easy first, try 1 * divisor; if too big, 0 * divisor

Dividend = (Quotient * Divisor) + Remainder 3 versions of divide hardware & algorithm:

Page 40: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide Version 1

Done

Test Remainder

2a. Shift the Quotient register to the left,setting the new rightmost bit to 1

3. Shift the Divisor register right 1 bit

33rd repetition?

Start

Remainder < 0

No: < 33 repetitions

Yes: 33 repetitions

2b. Restore the original value by addingthe Divisor register to the Remainder

register and place the sum in theRemainder register. Also shift the

Quotient register to the left, setting thenew least significant bit to 0

1. Subtract the Divisor register from theRemainder register and place the result in the Remainder register

Remainder > 0–

Itera- Step Quotient Divisor Remaindertion 0 init 0000 0010 0000 0000 0111 1 1 0000 0010 0000 1110 0111 2b 0000 0010 0000 0000 0111 3 0000 0001 0000 0000 0111

2 … 3 4 5

Example: 0111 / 0010:

Algorithm

Page 41: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide Version 1

64-bit ALU

Controltest

QuotientShift left

RemainderWrite

DivisorShift right

64 bits

64 bits

32 bits

Done

Test Remainder

2a. Shift the Quotient register to the left,setting the new rightmost bit to 1

3. Shift the Divisor register right 1 bit

33rd repetition?

Start

Remainder < 0

No: < 33 repetitions

Yes: 33 repetitions

2b. Restore the original value by addingthe Divisor register to the Remainder

register and place the sum in theRemainder register. Also shift the

Quotient register to the left, setting thenew least significant bit to 0

1. Subtract the Divisor register from theRemainder register and place the result in the Remainder register

Remainder > 0–

Divisor register, remainder register, ALU are64-bit wide; quotient register is 32-bit wide

Algorithm

32-bit divisor starts at left half of divisor register

Remainder register is initialized with the dividend at right

Why 33?

Quotient register isinitialized to be 0

Page 42: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide Version 1

Page 43: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Observations on Divide Version 1

Half the bits in divisor always 0 1/2 of 64-bit adder is wasted 1/2 of divisor register is wasted

Intuition: instead of shifting divisor to right, shift remainder to left…

Step 1 cannot produce a 1 in quotient bit – as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit)

Intuition: switch order to shift first and then subtract – can save 1 iteration…

Page 44: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide Version 2

Controltest

QuotientShift left

Write

32 bits

64 bits

32 bits

Shift left

Divisor

32-bit ALU

Remainder

Divisor register, quotient register, ALU are 32-bit wide; remainder register is 64-bit wide

Remainder register is initialized with the dividend at right

Done. Shift left half of Remainder right 1 bit

Test Remainder

3a. Shift the Remainder register to the left, setting the new rightmost bit to 0.

32nd repetition?

Start

Remainder < 0

No: < 32 repetitions

Yes: 32 repetitions

3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum

in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0

2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the

Remainder register

Remainder 0

1. Shift the Remainder register left 1 bit

–>

Also shift the Quotient register to the leftsetting to the new rightmost bit to 1.

Also shift the Quotient register to the leftsetting to the new rightmost bit to 0.

AlgorithmWhy this correction step?

Page 45: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Observations on Divide Version 2

Each step the remainder register wastes space that exactly matches the current size of the quotient

Intuition: combine quotient register and remainder register…

Page 46: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide Version 3

Done. Shift left half of Remainder right 1 bit

Test Remainder

3a. Shift the Remainder register to the left, setting the new rightmost bit to 1

32nd repetition?

Start

Remainder < 0

No: < 32 repetitions

Yes: 32 repetitions

3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum

in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0

2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the

Remainder register

Remainder 0

1. Shift the Remainder register left 1 bit

–>

Write

32 bits

64 bits

Shift leftShift right

Remainder

32-bit ALU

Divisor

Controltest

No separate quotient register; quotient is entered on the right side of the 64-bit remainder register

Algorithm

Remainder register is initialized with the dividend at right

Why this correction step?

Page 47: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide Version 3

Done. Shift left half of Remainder right 1 bit

Test Remainder

3a. Shift the Remainder register to the left, setting the new rightmost bit to 1

32nd repetition?

Start

Remainder < 0

No: < 32 repetitions

Yes: 32 repetitions

3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum

in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0

2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the

Remainder register

Remainder 0

1. Shift the Remainder register left 1 bit

–>

Example: 0111 / 0010:

Itera- Step Divisor Remaindertion 0 init 0010 0000 0111 1 0010 0000 1110 1 2 0010 1110 1110 3b 0010 0001 1100

2 … 3 4

Algorithm

Page 48: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Observations on Divide Version 3

Same hardware as Multiply Version 3

Signed divide: make both divisor and dividend positive and perform

division negate the quotient if divisor and dividend were of

opposite signs make the sign of the remainder match that of the dividend this ensures always

dividend = (quotient * divisor) + remainder –quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2

– 1)

Page 49: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Divide generates the reminder in hi and the quotient in lo

div $s0, $s1 # lo = $s0 / $s1

# hi = $s0 mod $s1

Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file

MIPS Divide Instruction

As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0.

op rs rt rd shamt funct

Page 50: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

MIPS Notes div rs, rt (signed), divu rs, rt (unsigned), with

two 32-bit register operands, divide the contents of the operands and put remainder in Hi register and quotient in Lo; overflow is ignored in both cases

pseudo-instructions div rdest, rsrc1, src2 (signed with overflow), divu rdest, rsrc1, src2 (unsigned without overflow) with three 32-bit register operands puts quotients of two registers into third

Page 51: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Floating Point

Page 52: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Review of Numbers• Computers are made to deal with numbers

• What can we represent in N bits?• Unsigned integers:

0 to 2N - 1• Signed Integers (Two’s Complement)

-2(N-1) to 2(N-1) - 1

Page 53: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Other Numbers• What about other numbers?

• Very large numbers? (seconds/century)3,155,760,00010 (3.1557610 x 109)

• Very small numbers? (atomic diameter)0.0000000110 (1.010 x 10-8)

• Rationals (repeating pattern) 2/3 (0.666666666. . .)

• Irrationals21/2 (1.414213562373. . .)

• Transcendentalse (2.718...), (3.141...)

• All represented in scientific notation

Page 54: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Scientific Notation (in Decimal)

6.0210 x 1023

radix (base)decimal point

mantissa exponent

• Normalized form: no leadings 0s (exactly one digit to left of decimal point)

• Alternatives to representing 1/1,000,000,000• Normalized: 1.0 x 10-9

• Not normalized: 0.1 x 10-8,10.0 x 10-10

Page 55: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Scientific Notation (in Binary)

1.0two x 2-1

radix (base)“binary point”

exponent

• Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers

• Declare such variable in C as float

mantissa

Page 56: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Floating Point Representation (1/2)• Normal format: +1.xxxxxxxxxxtwo*2yyyytwo

• Multiple of Word Size (32 bits)

031S Exponent30 23 22

Significand

1 bit 8 bits 23 bits

• S represents SignExponent represents y’sSignificand represents x’s

• Represent numbers as small as 2.0 x 10-38 to as large as 2.0 x 1038

Page 57: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Floating Point Representation (2/2)• What if result too large? (> 2.0x1038 )

• Overflow!• Overflow Exponent larger than represented in 8-bit Exponent field

• What if result too small? (>0, < 2.0x10-38 )• Underflow!• Underflow Negative exponent larger than represented in 8-bit Exponent field

• How to reduce chances of overflow or underflow?

Page 58: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Double Precision Fl. Pt. Representation• Next Multiple of Word Size (64 bits)

• Double Precision (vs. Single Precision)• C variable declared as double• Represent numbers almost as small as 2.0 x 10-308 to almost as large as 2.0 x 10308

• But primary advantage is greater accuracy due to larger significand

031S Exponent

30 20 19Significand

1 bit 11 bits 20 bitsSignificand (cont’d)

32 bits

Page 59: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

IEEE 754 Floating Point Standard (1/4)• Single Precision, DP similar

• Sign bit: 1 means negative0 means positive

• Significand:• To pack more bits, leading 1 implicit for normalized numbers

• 1 + 23 bits single, 1 + 52 bits double• always true: Significand < 1 (for normalized numbers)

• Note: 0 has no leading 1, so reserve exponent value 0 just for number 0

Page 60: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

IEEE 754 Floating Point Standard (2/4)• Want FP numbers to be used even if no

FP hardware; e.g., sort records with FP numbers using integer compares

• Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands

• Then want order:• Highest order bit is sign ( negative < positive)• Exponent next, so big exponent => bigger #• Significand last: exponents same => bigger #

Page 61: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

IEEE 754 Floating Point Standard (3/4)• Negative Exponent?

• 2’s comp? 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)

0 1111 1111 000 0000 0000 0000 0000 00001/20 0000 0001 000 0000 0000 0000 0000 00002• This notation using integer compare of 1/2 v. 2 makes 1/2 > 2!

• Instead, pick notation 0000 0001 is most negative, and 1111 1111 is most positive• 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)

1/2 0 0111 1110 000 0000 0000 0000 0000 0000

0 1000 0000 000 0000 0000 0000 0000 00002

Page 62: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

IEEE 754 Floating Point Standard (4/4)• Called Biased Notation, where bias is number subtracted to get real number• IEEE 754 uses bias of 127 for single prec.• Subtract 127 from Exponent field to get actual value for exponent

• 1023 is bias for double precision

• Summary (single precision):031

S Exponent30 23 22

Significand

1 bit 8 bits 23 bits• (-1)S x (1 + Significand) x 2(Exponent-127)

• Double precision identical, except with exponent bias of 1023

Page 63: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

• Floating Point numbers approximate values that we want to use.

• IEEE 754 Floating Point Standard is most widely accepted attempt to standardize interpretation of such numbers• Every desktop or server computer sold since ~1997 follows these conventions

• Summary (single precision):031

S Exponent30 23 22

Significand

1 bit 8 bits 23 bits• (-1)S x (1 + Significand) x 2(Exponent-127)

• Double precision identical, bias of 1023

Page 64: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Example: Converting Binary FP to Decimal

• Sign: 0 => positive

• Exponent: • 0110 1000two = 104ten

• Bias adjustment: 104 - 127 = -23

• Significand:• 1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22

= 1.0ten + 0.666115ten

0 0110 1000 101 0101 0100 0011 0100 0010

• Represents: 1.666115ten*2-23 ~ 1.986*10-7

(about 2/10,000,000)

Page 65: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Converting Decimal to FP (1/3)• Simple Case: If denominator is an exponent of 2 (2, 4, 8, 16, etc.), then it’s easy.

• Show MIPS representation of -0.75• -0.75 = -3/4• -11two/100two = -0.11two

• Normalized to -1.1two x 2-1

• (-1)S x (1 + Significand) x 2(Exponent-127)

• (-1)1 x (1 + .100 0000 ... 0000) x 2(126-127)

1 0111 1110 100 0000 0000 0000 0000 0000

Page 66: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Converting Decimal to FP (2/3)• Not So Simple Case: If denominator is not an exponent of 2.

• Then we can’t represent number precisely, but that’s why we have so many bits in significand: for precision

• Once we have significand, normalizing a number to get the exponent is easy.

• So how do we get the significand of a neverending number?

Page 67: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Converting Decimal to FP (3/3)• Fact: All rational numbers have a repeating pattern when written out in decimal.

• Fact: This still applies in binary.

• To finish conversion:• Write out binary number with repeating pattern.

• Cut it off after correct number of bits (different for single v. double precision).

• Derive Sign, Exponent and Significand fields.

Page 68: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Example: Representing 1/3 in MIPS• 1/3

= 0.33333…10

= 0.25 + 0.0625 + 0.015625 + 0.00390625 + …

= 1/4 + 1/16 + 1/64 + 1/256 + …

= 2-2 + 2-4 + 2-6 + 2-8 + …

= 0.0101010101… 2 * 20

= 1.0101010101… 2 * 2-2

• Sign: 0• Exponent = -2 + 127 = 125 = 01111101• Significand = 0101010101…

0 0111 1101 0101 0101 0101 0101 0101 010

Page 69: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Representation for ± ∞• In FP, divide by 0 should produce ± ∞, not overflow.

• Why?• OK to do further computations with ∞ E.g., X/0 > Y may be a valid comparison

• IEEE 754 represents ± ∞• Most positive exponent reserved for ∞• Significands all zeroes

Page 70: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Representation for 0• Represent 0?

• exponent all zeroes• significand all zeroes too• What about sign?• +0: 0 00000000 00000000000000000000000• -0: 1 00000000 00000000000000000000000

• Why two zeroes?• Helps in some limit comparisons

Page 71: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Special Numbers

• What have we defined so far? (Single Precision)

Exponent Significand Object

0 0 0

0 nonzero ???

1-254 anything +/- fl. pt. #

255 0 +/- ∞

255 nonzero ???

Page 72: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Representation for Not a Number• What is sqrt(-4.0)or 0/0?

• If ∞ not an error, these shouldn’t be either.• Called Not a Number (NaN)• Exponent = 255, Significand nonzero

• Why is this useful?• Hope NaNs help with debugging?• They contaminate: op(NaN, X) = NaN

Page 73: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Floating Point Addition Algorithm:

Done

2. Add the significands

4. Round the significand to the appropriatenumber of bits

Still normalized?

Start

Yes

No

No

YesOverflow orunderflow?

Exception

3. Normalize the sum, either shifting right andincrementing the exponent or shifting left

and decrementing the exponent

1. Compare the exponents of the two numbers.Shift the smaller number to the right until itsexponent would match the larger exponent

Page 74: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Floating Point Addition Hardware:

0 10 1 0 1

Control

Small ALU

Big ALU

Sign Exponent Significand Sign Exponent Significand

Exponentdifference

Shift right

Shift left or right

Rounding hardware

Sign Exponent Significand

Increment ordecrement

0 10 1

Shift smallernumber right

Compareexponents

Add

Normalize

Round

Page 75: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Floating Point Multpication Algorithm:

2. Multiply the significands

4. Round the significand to the appropriatenumber of bits

Still normalized?

Start

Yes

No

No

YesOverflow orunderflow?

Exception

3. Normalize the product if necessary, shiftingit right and incrementing the exponent

1. Add the biased exponents of the twonumbers, subtracting the bias from the sum

to get the new biased exponent

Done

5. Set the sign of the product to positive if thesigns of the original operands are the same;

if they differ make the sign negative

Page 76: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

MIPS Floating Point Architecture (1/4)

• Separate floating point instructions:• Single Precision:

add.s, sub.s, mul.s, div.s

• Double Precision:add.d, sub.d, mul.d,

div.d

• These are far more complicated than their integer counterparts

• Can take much longer to execute

Page 77: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

MIPS Floating Point Architecture (2/4)

• Problems:• Inefficient to have different instructions take vastly differing amounts of time.

• Generally, a particular piece of data will not change FP int within a program.

- Only 1 type of instruction will be used on it.

• Some programs do no FP calculations• It takes lots of hardware relative to integers to do FP fast

Page 78: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

MIPS Floating Point Architecture (3/4)

• 1990 Solution: Make a completely separate chip that handles only FP.

• Coprocessor 1: FP chip• contains 32 32-bit registers: $f0, $f1, …• most of the registers specified in .s and .d instruction refer to this set

• separate load and store: lwc1 and swc1(“load word coprocessor 1”, “store …”)

• Double Precision: by convention, even/odd pair contain one DP FP number: $f0/$f1, $f2/$f3, … , $f30/$f31

- Even register is the name

Page 79: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

MIPS Floating Point Architecture (4/4)

• Today, FP coprocessor integrated with CPU, or cheap chips may leave out FP HW

• Instructions to move data between main processor and coprocessors:

• mfc0, mtc0, mfc1, mtc1, etc.

• Appendix contains many more FP ops

Page 80: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Performance evaluation

Page 81: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Performance Metrics Purchasing perspective

given a collection of machines, which has the - best performance ?- least cost ?- best cost/performance?

Design perspective faced with design options, which has the

- best performance improvement ?- least cost ?- best cost/performance?

Both require basis for comparison metric for evaluation

Page 82: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Two Notions of “Performance”

Plane

Boeing 747

BAD/Sud Concorde

TopSpeedDC to Paris Passengers Throughput (pass

x mph)

610 mph6.5 hours 470 286,700

1350 mph3 hours 132 178,200

• Which has higher performance?• Time to deliver 1 passenger?• Time to deliver 400 passengers?

• In a computer, time for 1 job calledResponse Time or Execution Time

• In a computer, jobs per day calledThroughput or Bandwidth

• We will focus primarily on execution time for a single job

Page 83: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Performance Metrics

Normally we are interested in reducing Response time (aka execution time) – the time between the start and the completion of a task

Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

Additional Metrics Smallest/lightest Longest battery life Most reliable/durable (in space)

Page 84: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

What is Time?

Straightforward definition of time: Total time to complete a task, including disk accesses, memory

accesses, I/O activities, operating system overhead, ... “real time”, “response time” or “elapsed time”

Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time)

“CPU execution time” or “CPU time” Often divided into system CPU time (in OS) and user CPU time (in

user program)

Page 85: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Performance Factors Want to distinguish elapsed time and the time spent on

our task

CPU execution time (CPU time) – time the CPU spends working on a task

Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cycles for a program for a program = x clock cycle

time

CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------

Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Many techniques that decrease the number of clock cycles also increase the clock cycle time

or

Page 86: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Review: Machine Clock Rate

Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)

CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate

5 nsec clock cycle => 200 MHz clock rate

2 nsec clock cycle => 500 MHz clock rate

1 nsec clock cycle => 1 GHz clock rate

500 psec clock cycle => 2 GHz clock rate

250 psec clock cycle => 4 GHz clock rate

200 psec clock cycle => 5 GHz clock rate

Page 87: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Clock Cycles per Instruction Not all instructions take the same amount of time to

execute One way to think about execution time is that it equals the

number of instructions executed multiplied by the average time per instruction

Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA

# CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x

CPI for this instruction class

A B C

CPI 1 2 3

Page 88: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Effective CPI

Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = (CPIi x ICi)i = 1

n

Where ICi is the count (percentage) of the number of instructions of class i executed

CPIi is the (average) number of clock cycles per instruction for that instruction class

n is the number of instruction classes

The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

Page 89: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

THE Performance Equation Our basic performance equation is then

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count x CPI

clock_rate CPU time = -----------------------------------------------

or

These equations separate the three key factors that affect performance

Can measure the CPU execution time by running the program

The clock rate is usually given Can measure overall instruction count by using profilers/

simulators without knowing all of the implementation details

CPI varies by instruction type and ISA implementation for which we must know the implementation detailsinstruction count != the number of lines in the code

Page 90: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Determinates of CPU Performance

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

TechnologyX

XX

XX

X X

X

X

X

X

X

Page 91: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

A Simple Example

How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

How does this compare with using branch prediction to shave a cycle off the branch time?

Op Freq CPIi Freq x CPIi

ALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

=

.5

1.0

.3

.4

2.2

CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

1.6

.5

.4

.3

.4

.5

1.0

.3

.2

2.0

CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

Page 92: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Comparing and Summarizing Performance

Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

How do we summarize the performance for benchmark set with a single number?

The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)

AM = 1/n Timeii = 1

n

Where Timei is the execution time for the ith program of a total of n programs in the workload

A smaller mean indicates a smaller average execution time and thus improved performance

Page 93: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

SPEC Benchmarks www.spec.org

Integer benchmarks FP benchmarks

gzip compression wupwise Quantum chromodynamics

vpr FPGA place & route swim Shallow water model

gcc GNU C compiler mgrid Multigrid solver in 3D fields

mcf Combinatorial optimization applu Parabolic/elliptic pde

crafty Chess program mesa 3D graphics library

parser Word processing program galgel Computational fluid dynamics

eon Computer visualization art Image recognition (NN)

perlbmk perl application equake Seismic wave propagation simulation

gap Group theory interpreter facerec Facial image recognition

vortex Object oriented database ammp Computational chemistry

bzip2 compression lucas Primality testing

twolf Circuit place & route fma3d Crash simulation fem

sixtrack Nuclear physics accel

apsi Pollutant distribution

Page 94: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Example SPEC Ratings

Page 95: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Other Performance Metrics Power consumption – especially in the embedded market

where battery life is important (and passive cooling) For power-limited applications, the most important metric is

energy efficiency

Page 96: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Summary: Evaluating ISAs Design-time metrics:

Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation?

Static Metrics: How many bytes does the program occupy in memory?

Dynamic Metrics: How many instructions are executed? How many bytes does the

processor fetch to execute the program? How many clocks are required per instruction?

Best Metric: Time to execute the program! CPI

Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.

Page 97: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Example A certain computer has a CPU running at 500 MHz. Your PINE session

on this machine ends up executing 200 million instructions, of the following mix:

Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

What is the average CPI for this session?

Page 98: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session

on this machine ends up executing 200 million instructions, of the following mix:

Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

What is the average CPI for this session? Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.

Page 99: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session

on this machine ends up executing 200 million instructions, of the following mix:

Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

How much CPU time was used in this session?

Page 100: Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session

on this machine ends up executing 200 million instructions, of the following mix:

Type CPI Freq

A 4 20 %

B 3 30 %

C 1 40 %

D 2 10 %

How much CPU time was used in this session? Answer: (200 x 106 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) =

0.92 seconds.