Upload
peter-quinby
View
213
Download
0
Embed Size (px)
Citation preview
Computer Architecture and Design – ECEN 350
Part 7
[Some slides adapted from M. Irwin, D. Paterson and others]
ALU Implementation
Review: Basic Hardware
c = a . bba
000
010
001
111
b
ac
b
ac
a c
c = a + bba
000
110
101
111
10
01
c = aa
a0
b1
cd
0
1
a
c
b
d
1. AND gate (c = a . b)
2. OR gate (c = a + b)
3. Inverter (c = a)
4. Multiplexor (if d = = 0, c = a; else c = b)
To warm up let's build a logic unit to support the and and or instructions for MIPS (32-bit registers)
We'll just build a 1-bit unit and use 32 of them
Possible implementation using a multiplexor :
A Simple Multi-Function Logic Unit
a
boutput
operationselector
Selects one of the inputs to be the output based on a control input
Implementation with a Multiplexor
b
0
1
Result
Operation
a
.
.
.
32 u
nit
s
Not easy to decide the best way to implement do not want too many inputs to a single gate do not want to have to go through too many gates (=
levels) Let's look at a 1-bit ALU for addition:
Implementations
cout = a.b + a.cin + b.cin
sum = a.b.cin + a.b.cin + a.b.cin + a.b.cin
= a b cin
Sum
CarryIn
CarryOut
a
b
exclusive or (xor)
Building a 1-bit Binary Adder
1 bit Full Adder
A
BS
carry_in
carry_out
S = A xor B xor carry_in
carry_out = AB v Acarry_in v Bcarry_in (majority function)
How can we use it to build a 32-bit adder?
How can we modify it easily to build an adder/subtractor?
A B carry_in carry_out S
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
1-bit Adder Logic
Half-adder with one xor gate
Full-adder from 2 half-adders andan or gate
Half-adder with the xor gate replacedby primitive gates using the equationAB = A.B +A.B
xor
Building a 32-bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
Ripple-Carry Logic for 32-bit ALU
1-bit ALU for AND, OR and add
Multiplexor control line
Two's complement approach: just negate b and add. How do we negate?
recall negation shortcut : invert each bit of b and set CarryIn to least significant bit (ALU0) to 1
What about Subtraction (a – b) ?
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
Overflow Detection and Effects
Overflow: the result is too large to represent in the number of bits allocated
When adding operands with different signs, overflow cannot occur! Overflow occurs when
adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive
On overflow, an exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address (address of instruction causing the
overflow) is saved for possible resumption Don't always want to detect (interrupt on) overflow
Overflow Detection Overflow can be detected by:
Carry into MSB xor Carry out of MSB
Logic for overflow detection therefore can be put in
to ALU31
1
1
1 10
1
0
1
1
0
0 1 1 1
0 0 1 1+
7
3
0
1
– 6
1 1 0 0
1 0 1 1+
–4
– 5
71
0
New MIPS Instructions
Category Instr Op Code Example Meaning
Arithmetic(R & I format)
add unsigned 0 and 21 addu $s1, $s2, $s3 $s1 = $s2 + $s3
sub unsigned 0 and 23 subu $s1, $s2, $s3 $s1 = $s2 - $s3
add imm.unsigned
9 addiu $s1, $s2, 6 $s1 = $s2 + 6
Data Transfer
ld byte unsigned
24 lbu $s1, 25($s2) $s1 = Mem($s2+25)
ld half unsigned 25 lhu $s1, 25($s2) $s1 = Mem($s2+25)
Cond. Branch (I & R format)
set on less than unsigned
0 and 2b sltu $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0
set on less than imm unsigned
b sltiu $s1, $s2, 6 if ($s2<6) $s1=1else $s1=0
Sign extend – addiu, addiu, slti, sltiu Zero extend – andi, ori, xori Overflow detected – add, addi, sub
Tailoring the ALU to MIPS:Test for Less-than and Equality
Need to support the set-on-less-than instruction e.g., slt $t0, $t3, $t4 remember: slt is an R-type instruction that produces 1 if rs
< rt and 0 otherwise idea is to use subtraction: rs < rt rs – rt < 0. Recall msb
of negative number is 1 two cases after subtraction rs – rt:
if no overflow then rs < rt most significant bit of rs – rt = 1 if overflow then rs < rt most significant bit of rs – rt = 0
why? e.g., 5ten – 6ten = 0101 – 0110 = 0101 + 1010 = 1111 (ok!) -7ten – 6ten = 1001 – 0110 = 1001 + 1010 = 0011
(overflow!)
set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other Less bits are hardwired 0; so Less is the 32-bit result of slt
Supporting slt0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflowdetection
Overflow
a.
b.
Seta31
0
ALU0 Result0
CarryIn
a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
CarryIn
Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
1- bit ALU for the 31 least significant bits
1-bit ALU for the most significant bit
Extra set bit, to be routed to the Less input of the least significant 1-bitALU, is computed from the most significant Result bit and the Overflow bit(it is not the output of the adder as the figure seems to indicate)
Less input ofthe 31 most significant ALUsis always 0
32-bit ALU from 31 copies of ALU at top left and 1 copy of ALU at bottom left in the most significant position
Tailoring the ALU to MIPS:Test for Less-than and Equality
Need to support test for equality e.g., beq $t5, $t6, $t7
use subtraction: rs - rt = 0 rs = rt
do we need to consider overflow?
Supporting Test for Equality
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
ALU ResultZero
Overflow
a
b
ALU operation
CarryOut
ALU control lines
Bneg- Oper- Func-ate ation tion
0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt
Symbol representing ALU
Output is 1 only if all Result bits are 0
Combine CarryInto least significantALU and Binvert toa single control lineas both are alwayseither 1 or 0
32-bit MIPS ALU
Conclusion
We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using two’s
complement we can replicate a 1-bit ALU to produce a 32-bit ALU
Important points about hardware speed of a gate depends number of inputs (fan-in) to the
gate speed of a circuit depends on number of gates in series
(particularly, on the critical path to the deepest level of logic)
Speed of MIPS operations clever changes to organization can improve performance
(similar to using better algorithms in software) we’ll look at examples for addition, multiplication and
division
Building a 32-bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
Ripple-Carry Logic for 32-bit ALU
1-bit ALU for AND, OR and add
Multiplexor control line
Is a 32-bit ALU as fast as a 1-bit ALU? Why? Is there more than one way to do addition? Yes:
one extreme: ripple-carry – carry ripples through 32 ALUs, slow!
other extreme: sum-of-products for each CarryIn bit – super fast!
CarryIn bits:
c1 = b0.c0 + a0.c0 + a0.b0
c2 = b1.c1 + a1.c1 + a1.b1
= a1.a0.b0 + a1.a0.c0 + a1.b0.c0 (substituting for c1) + b1.a0.b0 + b1.a0.c0 + b1.b0.c0 + a1.b1
c3 = b2.c2 + a2.c2 + a2.b2
= … = sum of 15 4-term products… How fast? But not feasible for a 32-bit ALU! Why? Exponential
complexity!!
Problem: Ripple-carry Adder is Slow
Note: ci is CarryIn bit into i th ALU;c0 is the forced CarryIn into theleast significant ALU
An approach between our two extremes Motivation:
if we didn't know the value of a carry-in, what could we do? when would we always generate a carry? (generate) gi = ai . bi
when would we propagate the carry? (propagate) pi = ai + bi
Express (carry-in equations in terms of generate/propagates)c1 = g0 + p0.c0
c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0
c3 = g2 + p2.c2 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0
c4 = g3 + p3.c3 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0
+ p3.p2.p1.p0.c0 Feasible for 4-bit adders – with wider adders unacceptable
complexity. solution: build a first level using 4-bit adders, then a second level
on top
Two-level Carry-lookahead Adder: First Level
Two-level Carry-lookahead Adder: Second Level for a 16-bit adder
Propagate signals for each of the four 4-bit adder blocks:
P0 = p3.p2.p1.p0
P1 = p7.p6.p5.p4
P2 = p11.p10.p9.p8
P3 = p15.p14.p13.p12
Generate signals for each of the four 4-bit adder blocks:G0 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0
G1 = g7 + p7.g6 + p7.p6.g5 + p7.p6.p5.g4
G2 = g11 + p11.g10 + p11.p10.g9 + p11.p10.p9.g8
G3 = g15 + p15.g14 + p15.p14.g13 + p15.p14.p13.g12
Two-level Carry-lookahead Adder: Second Level for a 16-bit adder
CarryIn signals for each of the four 4-bit adder blocks (see earlier carry-in equations in terms of generate/propagates):
C1 = G0 + P0.c0
C2 = G1 + P1.G0 + P1.P0.c0
C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.c0
C4 = G3 + P3.G2 + P3.P2.G1 + P3.P2.P1.G0 + P3.P2.P1.P0.c0
C a rry In
R e s u lt0 - -3C a rry In
R e s u lt4 - -7C a rry In
R e s u lt8 - -1 1C a rry In
C a rry O u t
R e s u lt1 2 - - 1 5C a rry In
C 1
C 2
C 3
C 4
P 0G 0
P 1G 1
P 2G 2
P 3G 3
a 0b 0a 1b 1a 2b 2a 3b 3
a 4b 4a 5b 5a 6b 6a 7b 7
a 8b 8a 9b 9
a 1 0b 1 0a 1 1b 1 1
a 1 2b 1 2a 1 3b 1 3a 1 4b 1 4a 1 5b 1 5
Lo
gic
to
co
mp
ute
C1,
C2,
C3,
C4
4bAdder0
4bAdder1
4bAdder2
4bAdder3
16-bit carry-lookahead adder from four 4-bitadders and one carry-lookahead unit
Carry-lookaheadLogic
ALU0
ALU1
ALU2
ALU3
a1
a0
a2
a3
b1
b0
b2
b3
s0
s1
s2
s3
Lo
gic
to
co
mp
ute
c
1,
c2
, c
3,
c4
, P
0,
G0
Blow-up of 4-bit adder:(conceptually) consisting of four 1-bit ALUs plus logic to compute all CarryOut bits and one super generate and one super propagate bit.Each 1-bit ALU is exactly as for ripple-carry except c1, c2, c3 for ALUs 1, 2, 3 comes from the extra logic
CarryIn
Carry-lookahead Unit
Two-level carry-lookahead logic steps:1. compute pi’s and gi’s at each 1-bit ALU
2. compute Pi’s and Gi’s at each 4-bit adder unit
3. compute Ci’s in carry-lookahead unit
4. compute ci’s at each 4-bit adder unit
5. compute results (sum bits) at each 1-bit ALU
Two-level Carry-lookahead Adder: Second Level for a 16-bit adder
Multiplication
Standard shift-add method:
Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000
m bits x n bits = m+n bit product Binary makes it easy:
multiplier bit 1 => copy multiplicand (1 x multiplicand) multiplier bit 0 => place 0 (0 x multiplicand)
3 versions of multiply hardware & algorithms:
x
Shift-add Multiplier Version 1
Done
1. TestMultiplier0
1a. Add multiplicand to product andplace the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Algorithm
Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000
Shift-add Multiplier Version 1
64-bit ALU
Control test
MultiplierShift right
ProductWrite
MultiplicandShift left
64 bits
64 bits
32 bits
Done
1. TestMultiplier0
1a. Add multiplicand to product andplace the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Multiplicand register, product register, ALU are64-bit wide; multiplier register is 32-bit wide
Algorithm
32-bit multiplicand starts at right half of multiplicand register
Product register is initialized at 0
Shift-add Multiplier Version1
Done
1. TestMultiplier0
1a. Add multiplicand to product andplace the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Itera Step Multiplier Multiplicand Product-tion 0 init 0011 0000 0010 0000 0000 values 1 1a 0011 0000 0010 0000 0010 2 0011 0000 0100 0000 0010 3 0001 0000 0100 0000 0010
2 …
Example: 0010 * 0011:
Algorithm
Observations on Multiply Version 1
1 step per clock cycle nearly 100 clock cycles to multiply two
32-bit numbers Half the bits in the multiplicand register always 0 64-bit adder is wasted 0’s inserted to right as multiplicand is shifted left
least significant bits of product never change once formed
Intuition: instead of shifting multiplicand to left, shift product to right…
Shift-add Multiplier Version 2
Done
1. TestMultiplier0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Algorithm
Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000
Shift-add Multiplier Version 2
MultiplierShift right
Write
32 bits
64 bits
32 bits
Shift right
Multiplicand
32-bit ALU
Product Control test
Done
1. TestMultiplier0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Multiplicand register, multiplier register, ALUare 32-bit wide; product register is 64-bit wide;multiplicand adds to left half of product register
Algorithm
Product register is initialized at 0
Shift-add Multiplier Version 2
Done
1. TestMultiplier0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Itera Step Multiplier Multiplicand Product-tion 0 init 0011 0010 0000 0000 values 1 1a 0011 0010 0010 0000 2 0011 0010 0001 0000 3 0001 0010 0001 0000
2 …
Example: 0010 * 0011:
Algorithm
Observations on Multiply Version 2
Each step the product register wastes space that exactly matches the current size of the multiplier
Intuition: combine multiplier register and product register…
Shift-add Multiplier Version 3
ControltestWrite
32 bits
64 bits
Shift rightProduct
Multiplicand
32-bit ALU
Done
1. TestProduct0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
32nd repetition?
Start
Product0 = 0Product0 = 1
No: < 32 repetitions
Yes: 32 repetitionsNo separate multiplier register; multiplier placed on right side of 64-bit product register
Algorithm
Product register is initialized with multiplier on right
Shift-add Multiplier Version 3
Done
1. TestProduct0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
32nd repetition?
Start
Product0 = 0Product0 = 1
No: < 32 repetitions
Yes: 32 repetitions
Itera Step Multiplicand Product-tion 0 init 0010 0000 0011 values 1 1a 0010 0010 0011 2 0010 0001 0001
2 …
Example: 0010 * 0011:
Algorithm
Observations on Multiply Version 3
2 steps per bit because multiplier & product combined What about signed multiplication?
easiest solution is to make both positive and remember whether to negate product when done, i.e., leave out the sign bit, run for 31 steps, then negate if multiplier and multiplicand have opposite signs
Booth’s Algorithm is an elegant way to multiply signed numbers using same hardware – it also often quicker…
MIPS Notes MIPS provides two 32-bit registers Hi and Lo to hold a
64-bit product mult, multu (unsigned) put the product of two 32-bit
register operands into Hi and Lo: overflow is ignored by MIPS but can be detected by programmer by examining contents of Hi
mflo, mfhi moves content of Hi or Lo to a general-purpose register
Pseudo-instructions mul (without overflow), mulo (with overflow), mulou (unsigned with overflow) take three 32-bit register operands, putting the product of two registers into the third
Divide 1001 Quotient
Divisor 1000 1001010 Dividend –1000 10 101 1010 –1000 10 Remainder
standart method: see how big a multiple of the divisor can be subtracted, creating quotient digit at each step
Binary makes it easy first, try 1 * divisor; if too big, 0 * divisor
Dividend = (Quotient * Divisor) + Remainder 3 versions of divide hardware & algorithm:
Divide Version 1
Done
Test Remainder
2a. Shift the Quotient register to the left,setting the new rightmost bit to 1
3. Shift the Divisor register right 1 bit
33rd repetition?
Start
Remainder < 0
No: < 33 repetitions
Yes: 33 repetitions
2b. Restore the original value by addingthe Divisor register to the Remainder
register and place the sum in theRemainder register. Also shift the
Quotient register to the left, setting thenew least significant bit to 0
1. Subtract the Divisor register from theRemainder register and place the result in the Remainder register
Remainder > 0–
Itera- Step Quotient Divisor Remaindertion 0 init 0000 0010 0000 0000 0111 1 1 0000 0010 0000 1110 0111 2b 0000 0010 0000 0000 0111 3 0000 0001 0000 0000 0111
2 … 3 4 5
Example: 0111 / 0010:
Algorithm
Divide Version 1
64-bit ALU
Controltest
QuotientShift left
RemainderWrite
DivisorShift right
64 bits
64 bits
32 bits
Done
Test Remainder
2a. Shift the Quotient register to the left,setting the new rightmost bit to 1
3. Shift the Divisor register right 1 bit
33rd repetition?
Start
Remainder < 0
No: < 33 repetitions
Yes: 33 repetitions
2b. Restore the original value by addingthe Divisor register to the Remainder
register and place the sum in theRemainder register. Also shift the
Quotient register to the left, setting thenew least significant bit to 0
1. Subtract the Divisor register from theRemainder register and place the result in the Remainder register
Remainder > 0–
Divisor register, remainder register, ALU are64-bit wide; quotient register is 32-bit wide
Algorithm
32-bit divisor starts at left half of divisor register
Remainder register is initialized with the dividend at right
Why 33?
Quotient register isinitialized to be 0
Divide Version 1
Observations on Divide Version 1
Half the bits in divisor always 0 1/2 of 64-bit adder is wasted 1/2 of divisor register is wasted
Intuition: instead of shifting divisor to right, shift remainder to left…
Step 1 cannot produce a 1 in quotient bit – as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit)
Intuition: switch order to shift first and then subtract – can save 1 iteration…
Divide Version 2
Controltest
QuotientShift left
Write
32 bits
64 bits
32 bits
Shift left
Divisor
32-bit ALU
Remainder
Divisor register, quotient register, ALU are 32-bit wide; remainder register is 64-bit wide
Remainder register is initialized with the dividend at right
Done. Shift left half of Remainder right 1 bit
Test Remainder
3a. Shift the Remainder register to the left, setting the new rightmost bit to 0.
32nd repetition?
Start
Remainder < 0
No: < 32 repetitions
Yes: 32 repetitions
3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum
in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0
2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the
Remainder register
Remainder 0
1. Shift the Remainder register left 1 bit
–>
Also shift the Quotient register to the leftsetting to the new rightmost bit to 1.
Also shift the Quotient register to the leftsetting to the new rightmost bit to 0.
AlgorithmWhy this correction step?
Observations on Divide Version 2
Each step the remainder register wastes space that exactly matches the current size of the quotient
Intuition: combine quotient register and remainder register…
Divide Version 3
Done. Shift left half of Remainder right 1 bit
Test Remainder
3a. Shift the Remainder register to the left, setting the new rightmost bit to 1
32nd repetition?
Start
Remainder < 0
No: < 32 repetitions
Yes: 32 repetitions
3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum
in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0
2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the
Remainder register
Remainder 0
1. Shift the Remainder register left 1 bit
–>
Write
32 bits
64 bits
Shift leftShift right
Remainder
32-bit ALU
Divisor
Controltest
No separate quotient register; quotient is entered on the right side of the 64-bit remainder register
Algorithm
Remainder register is initialized with the dividend at right
Why this correction step?
Divide Version 3
Done. Shift left half of Remainder right 1 bit
Test Remainder
3a. Shift the Remainder register to the left, setting the new rightmost bit to 1
32nd repetition?
Start
Remainder < 0
No: < 32 repetitions
Yes: 32 repetitions
3b. Restore the original value by addingthe Divisor register to the left half of theRemainder register and place the sum
in the left half of the Remainder register.Also shift the Remainder register to theleft, setting the new rightmost bit to 0
2. Subtract the Divisor register from theleft half of the Remainder register andplace the result in the left half of the
Remainder register
Remainder 0
1. Shift the Remainder register left 1 bit
–>
Example: 0111 / 0010:
Itera- Step Divisor Remaindertion 0 init 0010 0000 0111 1 0010 0000 1110 1 2 0010 1110 1110 3b 0010 0001 1100
2 … 3 4
Algorithm
Observations on Divide Version 3
Same hardware as Multiply Version 3
Signed divide: make both divisor and dividend positive and perform
division negate the quotient if divisor and dividend were of
opposite signs make the sign of the remainder match that of the dividend this ensures always
dividend = (quotient * divisor) + remainder –quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2
– 1)
Divide generates the reminder in hi and the quotient in lo
div $s0, $s1 # lo = $s0 / $s1
# hi = $s0 mod $s1
Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file
MIPS Divide Instruction
As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0.
op rs rt rd shamt funct
MIPS Notes div rs, rt (signed), divu rs, rt (unsigned), with
two 32-bit register operands, divide the contents of the operands and put remainder in Hi register and quotient in Lo; overflow is ignored in both cases
pseudo-instructions div rdest, rsrc1, src2 (signed with overflow), divu rdest, rsrc1, src2 (unsigned without overflow) with three 32-bit register operands puts quotients of two registers into third
Floating Point
Review of Numbers• Computers are made to deal with numbers
• What can we represent in N bits?• Unsigned integers:
0 to 2N - 1• Signed Integers (Two’s Complement)
-2(N-1) to 2(N-1) - 1
Other Numbers• What about other numbers?
• Very large numbers? (seconds/century)3,155,760,00010 (3.1557610 x 109)
• Very small numbers? (atomic diameter)0.0000000110 (1.010 x 10-8)
• Rationals (repeating pattern) 2/3 (0.666666666. . .)
• Irrationals21/2 (1.414213562373. . .)
• Transcendentalse (2.718...), (3.141...)
• All represented in scientific notation
Scientific Notation (in Decimal)
6.0210 x 1023
radix (base)decimal point
mantissa exponent
• Normalized form: no leadings 0s (exactly one digit to left of decimal point)
• Alternatives to representing 1/1,000,000,000• Normalized: 1.0 x 10-9
• Not normalized: 0.1 x 10-8,10.0 x 10-10
Scientific Notation (in Binary)
1.0two x 2-1
radix (base)“binary point”
exponent
• Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers
• Declare such variable in C as float
mantissa
Floating Point Representation (1/2)• Normal format: +1.xxxxxxxxxxtwo*2yyyytwo
• Multiple of Word Size (32 bits)
031S Exponent30 23 22
Significand
1 bit 8 bits 23 bits
• S represents SignExponent represents y’sSignificand represents x’s
• Represent numbers as small as 2.0 x 10-38 to as large as 2.0 x 1038
Floating Point Representation (2/2)• What if result too large? (> 2.0x1038 )
• Overflow!• Overflow Exponent larger than represented in 8-bit Exponent field
• What if result too small? (>0, < 2.0x10-38 )• Underflow!• Underflow Negative exponent larger than represented in 8-bit Exponent field
• How to reduce chances of overflow or underflow?
Double Precision Fl. Pt. Representation• Next Multiple of Word Size (64 bits)
• Double Precision (vs. Single Precision)• C variable declared as double• Represent numbers almost as small as 2.0 x 10-308 to almost as large as 2.0 x 10308
• But primary advantage is greater accuracy due to larger significand
031S Exponent
30 20 19Significand
1 bit 11 bits 20 bitsSignificand (cont’d)
32 bits
IEEE 754 Floating Point Standard (1/4)• Single Precision, DP similar
• Sign bit: 1 means negative0 means positive
• Significand:• To pack more bits, leading 1 implicit for normalized numbers
• 1 + 23 bits single, 1 + 52 bits double• always true: Significand < 1 (for normalized numbers)
• Note: 0 has no leading 1, so reserve exponent value 0 just for number 0
IEEE 754 Floating Point Standard (2/4)• Want FP numbers to be used even if no
FP hardware; e.g., sort records with FP numbers using integer compares
• Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands
• Then want order:• Highest order bit is sign ( negative < positive)• Exponent next, so big exponent => bigger #• Significand last: exponents same => bigger #
IEEE 754 Floating Point Standard (3/4)• Negative Exponent?
• 2’s comp? 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)
0 1111 1111 000 0000 0000 0000 0000 00001/20 0000 0001 000 0000 0000 0000 0000 00002• This notation using integer compare of 1/2 v. 2 makes 1/2 > 2!
• Instead, pick notation 0000 0001 is most negative, and 1111 1111 is most positive• 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)
1/2 0 0111 1110 000 0000 0000 0000 0000 0000
0 1000 0000 000 0000 0000 0000 0000 00002
IEEE 754 Floating Point Standard (4/4)• Called Biased Notation, where bias is number subtracted to get real number• IEEE 754 uses bias of 127 for single prec.• Subtract 127 from Exponent field to get actual value for exponent
• 1023 is bias for double precision
• Summary (single precision):031
S Exponent30 23 22
Significand
1 bit 8 bits 23 bits• (-1)S x (1 + Significand) x 2(Exponent-127)
• Double precision identical, except with exponent bias of 1023
• Floating Point numbers approximate values that we want to use.
• IEEE 754 Floating Point Standard is most widely accepted attempt to standardize interpretation of such numbers• Every desktop or server computer sold since ~1997 follows these conventions
• Summary (single precision):031
S Exponent30 23 22
Significand
1 bit 8 bits 23 bits• (-1)S x (1 + Significand) x 2(Exponent-127)
• Double precision identical, bias of 1023
Example: Converting Binary FP to Decimal
• Sign: 0 => positive
• Exponent: • 0110 1000two = 104ten
• Bias adjustment: 104 - 127 = -23
• Significand:• 1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22
= 1.0ten + 0.666115ten
0 0110 1000 101 0101 0100 0011 0100 0010
• Represents: 1.666115ten*2-23 ~ 1.986*10-7
(about 2/10,000,000)
Converting Decimal to FP (1/3)• Simple Case: If denominator is an exponent of 2 (2, 4, 8, 16, etc.), then it’s easy.
• Show MIPS representation of -0.75• -0.75 = -3/4• -11two/100two = -0.11two
• Normalized to -1.1two x 2-1
• (-1)S x (1 + Significand) x 2(Exponent-127)
• (-1)1 x (1 + .100 0000 ... 0000) x 2(126-127)
1 0111 1110 100 0000 0000 0000 0000 0000
Converting Decimal to FP (2/3)• Not So Simple Case: If denominator is not an exponent of 2.
• Then we can’t represent number precisely, but that’s why we have so many bits in significand: for precision
• Once we have significand, normalizing a number to get the exponent is easy.
• So how do we get the significand of a neverending number?
Converting Decimal to FP (3/3)• Fact: All rational numbers have a repeating pattern when written out in decimal.
• Fact: This still applies in binary.
• To finish conversion:• Write out binary number with repeating pattern.
• Cut it off after correct number of bits (different for single v. double precision).
• Derive Sign, Exponent and Significand fields.
Example: Representing 1/3 in MIPS• 1/3
= 0.33333…10
= 0.25 + 0.0625 + 0.015625 + 0.00390625 + …
= 1/4 + 1/16 + 1/64 + 1/256 + …
= 2-2 + 2-4 + 2-6 + 2-8 + …
= 0.0101010101… 2 * 20
= 1.0101010101… 2 * 2-2
• Sign: 0• Exponent = -2 + 127 = 125 = 01111101• Significand = 0101010101…
0 0111 1101 0101 0101 0101 0101 0101 010
Representation for ± ∞• In FP, divide by 0 should produce ± ∞, not overflow.
• Why?• OK to do further computations with ∞ E.g., X/0 > Y may be a valid comparison
• IEEE 754 represents ± ∞• Most positive exponent reserved for ∞• Significands all zeroes
Representation for 0• Represent 0?
• exponent all zeroes• significand all zeroes too• What about sign?• +0: 0 00000000 00000000000000000000000• -0: 1 00000000 00000000000000000000000
• Why two zeroes?• Helps in some limit comparisons
Special Numbers
• What have we defined so far? (Single Precision)
Exponent Significand Object
0 0 0
0 nonzero ???
1-254 anything +/- fl. pt. #
255 0 +/- ∞
255 nonzero ???
Representation for Not a Number• What is sqrt(-4.0)or 0/0?
• If ∞ not an error, these shouldn’t be either.• Called Not a Number (NaN)• Exponent = 255, Significand nonzero
• Why is this useful?• Hope NaNs help with debugging?• They contaminate: op(NaN, X) = NaN
Floating Point Addition Algorithm:
Done
2. Add the significands
4. Round the significand to the appropriatenumber of bits
Still normalized?
Start
Yes
No
No
YesOverflow orunderflow?
Exception
3. Normalize the sum, either shifting right andincrementing the exponent or shifting left
and decrementing the exponent
1. Compare the exponents of the two numbers.Shift the smaller number to the right until itsexponent would match the larger exponent
Floating Point Addition Hardware:
0 10 1 0 1
Control
Small ALU
Big ALU
Sign Exponent Significand Sign Exponent Significand
Exponentdifference
Shift right
Shift left or right
Rounding hardware
Sign Exponent Significand
Increment ordecrement
0 10 1
Shift smallernumber right
Compareexponents
Add
Normalize
Round
Floating Point Multpication Algorithm:
2. Multiply the significands
4. Round the significand to the appropriatenumber of bits
Still normalized?
Start
Yes
No
No
YesOverflow orunderflow?
Exception
3. Normalize the product if necessary, shiftingit right and incrementing the exponent
1. Add the biased exponents of the twonumbers, subtracting the bias from the sum
to get the new biased exponent
Done
5. Set the sign of the product to positive if thesigns of the original operands are the same;
if they differ make the sign negative
MIPS Floating Point Architecture (1/4)
• Separate floating point instructions:• Single Precision:
add.s, sub.s, mul.s, div.s
• Double Precision:add.d, sub.d, mul.d,
div.d
• These are far more complicated than their integer counterparts
• Can take much longer to execute
MIPS Floating Point Architecture (2/4)
• Problems:• Inefficient to have different instructions take vastly differing amounts of time.
• Generally, a particular piece of data will not change FP int within a program.
- Only 1 type of instruction will be used on it.
• Some programs do no FP calculations• It takes lots of hardware relative to integers to do FP fast
MIPS Floating Point Architecture (3/4)
• 1990 Solution: Make a completely separate chip that handles only FP.
• Coprocessor 1: FP chip• contains 32 32-bit registers: $f0, $f1, …• most of the registers specified in .s and .d instruction refer to this set
• separate load and store: lwc1 and swc1(“load word coprocessor 1”, “store …”)
• Double Precision: by convention, even/odd pair contain one DP FP number: $f0/$f1, $f2/$f3, … , $f30/$f31
- Even register is the name
MIPS Floating Point Architecture (4/4)
• Today, FP coprocessor integrated with CPU, or cheap chips may leave out FP HW
• Instructions to move data between main processor and coprocessors:
• mfc0, mtc0, mfc1, mtc1, etc.
• Appendix contains many more FP ops
Performance evaluation
Performance Metrics Purchasing perspective
given a collection of machines, which has the - best performance ?- least cost ?- best cost/performance?
Design perspective faced with design options, which has the
- best performance improvement ?- least cost ?- best cost/performance?
Both require basis for comparison metric for evaluation
Two Notions of “Performance”
Plane
Boeing 747
BAD/Sud Concorde
TopSpeedDC to Paris Passengers Throughput (pass
x mph)
610 mph6.5 hours 470 286,700
1350 mph3 hours 132 178,200
• Which has higher performance?• Time to deliver 1 passenger?• Time to deliver 400 passengers?
• In a computer, time for 1 job calledResponse Time or Execution Time
• In a computer, jobs per day calledThroughput or Bandwidth
• We will focus primarily on execution time for a single job
Performance Metrics
Normally we are interested in reducing Response time (aka execution time) – the time between the start and the completion of a task
Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors
Additional Metrics Smallest/lightest Longest battery life Most reliable/durable (in space)
What is Time?
Straightforward definition of time: Total time to complete a task, including disk accesses, memory
accesses, I/O activities, operating system overhead, ... “real time”, “response time” or “elapsed time”
Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time)
“CPU execution time” or “CPU time” Often divided into system CPU time (in OS) and user CPU time (in
user program)
Performance Factors Want to distinguish elapsed time and the time spent on
our task
CPU execution time (CPU time) – time the CPU spends working on a task
Does not include time waiting for I/O or running other programs
CPU execution time # CPU clock cycles for a program for a program = x clock cycle
time
CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------
Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
Many techniques that decrease the number of clock cycles also increase the clock cycle time
or
Review: Machine Clock Rate
Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec clock cycle => 1 GHz clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
Clock Cycles per Instruction Not all instructions take the same amount of time to
execute One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time per instruction
Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA
# CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x
CPI for this instruction class
A B C
CPI 1 2 3
Effective CPI
Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging
Overall effective CPI = (CPIi x ICi)i = 1
n
Where ICi is the count (percentage) of the number of instructions of class i executed
CPIi is the (average) number of clock cycles per instruction for that instruction class
n is the number of instruction classes
The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs
THE Performance Equation Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count x CPI
clock_rate CPU time = -----------------------------------------------
or
These equations separate the three key factors that affect performance
Can measure the CPU execution time by running the program
The clock rate is usually given Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which we must know the implementation detailsinstruction count != the number of lines in the code
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count
CPI clock_cycle
Algorithm
Programming language
Compiler
ISA
Processor organization
TechnologyX
XX
XX
X X
X
X
X
X
X
A Simple Example
How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave a cycle off the branch time?
Op Freq CPIi Freq x CPIi
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
=
.5
1.0
.3
.4
2.2
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
1.6
.5
.4
.3
.4
.5
1.0
.3
.2
2.0
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
Comparing and Summarizing Performance
Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))
How do we summarize the performance for benchmark set with a single number?
The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)
AM = 1/n Timeii = 1
n
Where Timei is the execution time for the ith program of a total of n programs in the workload
A smaller mean indicates a smaller average execution time and thus improved performance
SPEC Benchmarks www.spec.org
Integer benchmarks FP benchmarks
gzip compression wupwise Quantum chromodynamics
vpr FPGA place & route swim Shallow water model
gcc GNU C compiler mgrid Multigrid solver in 3D fields
mcf Combinatorial optimization applu Parabolic/elliptic pde
crafty Chess program mesa 3D graphics library
parser Word processing program galgel Computational fluid dynamics
eon Computer visualization art Image recognition (NN)
perlbmk perl application equake Seismic wave propagation simulation
gap Group theory interpreter facerec Facial image recognition
vortex Object oriented database ammp Computational chemistry
bzip2 compression lucas Primality testing
twolf Circuit place & route fma3d Crash simulation fem
sixtrack Nuclear physics accel
apsi Pollutant distribution
Example SPEC Ratings
Other Performance Metrics Power consumption – especially in the embedded market
where battery life is important (and passive cooling) For power-limited applications, the most important metric is
energy efficiency
Summary: Evaluating ISAs Design-time metrics:
Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation?
Static Metrics: How many bytes does the program occupy in memory?
Dynamic Metrics: How many instructions are executed? How many bytes does the
processor fetch to execute the program? How many clocks are required per instruction?
Best Metric: Time to execute the program! CPI
Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.
Example A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2 10 %
What is the average CPI for this session?
Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2 10 %
What is the average CPI for this session? Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.
Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2 10 %
How much CPU time was used in this session?
Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2 10 %
How much CPU time was used in this session? Answer: (200 x 106 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) =
0.92 seconds.