14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 7

331 W07.1 Fall 2003

14:332:331Computer Architecture and Assembly Language

Fall 2003

Week 7

[Adapted from Dave Patterson’s UCB CS152 slides and

Mary Jane Irwin’s PSU CSE331 slides]

331 W07.2 Fall 2003

331 W07.3 Fall 2003

Head’s Up This week’s material

MIPS logic and multiply instructions- Reading assignment – PH 4.4

MIPS ALU design- Reading assignment – PH 4.5

Next week’s material Building a MIPS datapath

- Reading assignment – PH 5.1-5.2

331 W07.4 Fall 2003

Review: MIPS Arithmetic Instructions

R-type:

I-Type:

31 25 20 15 5 0

op Rs Rt Rd funct

op Rs Rt Immed 16

Type op funct

ADD 00 100000

ADDU 00 100001

SUB 00 100010

SUBU 00 100011

AND 00 100100

OR 00 100101

XOR 00 100110

NOR 00 100111

Type op funct

00 101000

00 101001

SLT 00 101010

SLTU 00 101011

00 101100

0 add

1 addu

2 sub

3 subu

4 and

5 or

6 xor

7 nor

a slt

b sltu

expand immediates to 32 bits before ALU10 operations so can encode in 4 bits

32

32

32

m (operation)

result

A

B

ALU

4

zeroovf

11

331 W07.5 Fall 2003

Review: A 32-bit Adder/Subtractor

1-bit FA S0

c0=carry_in

c1

1-bit FA S1

c2

1-bit FA S2

c3

c32=carry_out

1-bit FA S31

c31

. .

.

Built out of 32 full adders (FAs) A0

B0

A1

B1

A2

B2

A31

B31

add/subt

1 bit FA

A

BS

carry_in

carry_out

S = A xor B xor carry_in

carry_out = AB v Acarry_in v Bcarry_in (majority function)

Small but slow!

331 W07.6 Fall 2003

Minimal Implementation of a Full Adder

architecture concurrent_behavior of full_adder is

signal t1, t2, t3, t4, t5: std_logic;

begin

t1 <= not A after 1 ns;

t2 <= not cin after 1 ns;

t4 <= not((A or cin) and B) after 2 ns;

t3 <= not((t1 or t2) and (A or cin)) after 2 ns;

t5 <= t3 nand B after 2 ns;

S <= not((B or t3) and t5) after 2 ns;

cout <= not(t1 or t2) and t4) after 2 ns;

end concurrent_behavior;

Can you create the equivalent schematic? Can you determine worst case delay (the worst case timing path through the circuit)?

Gate library: inverters, 2-input nands, or-and-inverters

331 W07.7 Fall 2003

Logic Operations Logic operations operate on individual bits of the

operand. $t2 = 0…0 0000 1101 0000

$t1 = 0…0 0011 1100 0000

and $t0, $t1, $t2 $t0 =

or $t0, $t1 $t2 $t0 =

xor $t0, $t1, $t2 $t0 =

nor $t0, $t1, $t2 $t0 =

How do we expand our FA design to handle the logic operations - and, or, xor, nor ?

331 W07.8 Fall 2003

A Simple ALU Cell

1-bit FA

carry_in

carry_out

A

B

add/subt

add/subt

result

op

331 W07.9 Fall 2003

An Alternative ALU Cell

1-bit FA

carry_in

s1

s2

s0

result

carry_out

A

B

331 W07.10 Fall 2003

The Alternative ALU Cell’s Control Codes

s2 s1 s0 c_in result function0 0 0 0 A transfer A

0 0 0 1 A + 1 increment A

0 0 1 0 A + B add

0 0 1 1 A + B + 1 add with carry

0 1 0 0 A – B – 1 subt with borrow

0 1 0 1 A – B subtract

0 1 1 0 A – 1 decrement A

0 1 1 1 A transfer A

1 0 0 x A or B or

1 0 1 x A xor B xor

1 1 0 x A and B and

1 1 1 x !A complement A

331 W07.11 Fall 2003

Need to support the set-on-less-than instruction

(slt)

remember: slt is an arithmetic instruction

produces a 1 if rs < rt and 0 otherwise

use subtraction: (a - b) < 0 implies a < b

Need to support test for equality (beq)

use subtraction: (a - b) = 0 implies a = b

Need to add the overflow detection hardware

Tailoring the ALU to the MIPS ISA

331 W07.12 Fall 2003

Modifying the ALU Cell for slt

1-bit FA

A

B

result

carry_in

carry_out

add/subt op

add/subt

less

331 W07.13 Fall 2003

Modifying the ALU for slt

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A31

B31

result31

less

. . .

First perform a subtraction

Make the result 1 if the subtraction yields a negative result

Make the result 0 if the subtraction yields a positive result

331 W07.14 Fall 2003

Modifying the ALU for Zero

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A31

B31

result31

less

. . .

0

0

set

First perform subtraction

Insert additional logic to detect when all result bits are zero

add/subtop

331 W07.15 Fall 2003

Review: Overflow Detection Overflow: the result is too large to represent in the

number of bits allocated

Overflow occurs when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive

On your own: Prove you can detect overflow by: Carry into MSB xor Carry out of MSB

1

1

1 10

1

0

1

1

0

0 1 1 1

0 0 1 1+

7

3

0

1

– 6

1 1 0 0

1 0 1 1+

–4

– 5

71

0

331 W07.16 Fall 2003

Modifying the ALU for Overflow

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A31

B31

result31

less

. . .

0

0

set

Modify the most significant cell to determine overflow output setting

Disable overflow bit setting for unsigned arithmetic

zero

. . .

add/subtop

overflow

331 W07.17 Fall 2003

Example:

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6

When do the result outputs settle at their final values for the inputs:

add/subt = 0op = 000A = 1111B = 0001

331 W07.18 Fall 2003

Example: cont’d

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6


add/subt = 0op = 100A = 1111B = 0001

331 W07.19 Fall 2003

Example: cont’d

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6


add/subt = 1op = 101A = 1111B = 0001

What is the zero output of these inputs?

331 W07.20 Fall 2003

Example: cont’d

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

op

+

A1

B1

result1

less

+

A0

B0

result0

less

+

A3

B3

result3

less

0

0

set

zero

add/subt

overflow

+

A2

B2

result2

less0

012345

2

+2

+6

+4

+8

+8

+8

+8

+6

With the ALU design described in class, we assumed that a subtraction operation had to be performed as part of the beq instruction. When do the outputs settle?

Is there a faster alternative?

331 W07.21 Fall 2003

But What about Performance? Critical path of n-bit ripple-carry adder is n*CP

Design trick – throw hardware at it (Carry Lookahead)

A0

B0

1-bitALU

Result0

CarryIn0

CarryOut0

A1

B1

1-bitALU

Result1

CarryIn1

CarryOut1

A2

B2

1-bitALU

Result2

CarryIn2

CarryOut2

A3

B3

1-bitALU

Result3

CarryIn3

CarryOut3

331 W07.22 Fall 2003

Fast carry using “infinite” hardware (Parallel) cout = b • cin + a • cin + a • b

c1 = (b0+a0)•c0 + a0•b0 = a0•b0 + a0•c0 + b0•c0

c2 = (b1+a1)•c1 + a1•b1

= (b1+a1)•((b0+a0)•c0 + a0•b0) + a1•b1

= a1•a0•b0 + a1•a0•c0 + b1•a0•c0 + b1•a0•b0 + a1•b0•c0 + b1•b0•c0 + b1•a1

c3 = a2•a1•a0•b0 + a2•a1•a0•c0 + a2•b1•a0•c0 + a2•b1•a0•b0 + a2•a1•b0•c0 + a2•b1•b0•c0 + a2•b1•a1 + …

…

Outputs settle much faster D_c3 = 2* D_and + D_or (best case) … D_c31 = 5 *D_and + D_or (best case)

Problem: Prohibitively expensive

331 W07.23 Fall 2003

Hierarchical Solution I

Hierarchical solution I Group 32 bits into 8 4-bit groups Within each group, use carry look ahead Use 4-bit as a building block, and connect them in ripple

carry fashion.

331 W07.24 Fall 2003

First Level: Propagate and generate

ci+1 = (ai•bi)+(ai+bi)•ci

gi = ai•bi

pi = (ai+bi)

ci+1 = 1 if gi = 1, or pi and ci = 1

c1 = g0+(p0•c0)

c2 = g1+(p1•g0)+(p1•p0•c0)

c3 = g2+(p2•g1)+(p2•p1•g0)+(p2•p1•p0•c0)

c4 = g3+(p3•g2)+(p3•p2•g1)+ (p3•p2•p1•g0) + (p3•p2•p1•p0•c0)

ci+1 = gi + pi•ci

331 W07.25 Fall 2003

Hierarchical Solution I (16 bit)

ALU0

A0

B0

c0=carry_in

A1B1

A2

B2A3

B3

ALU1

A4

B4

c4=carry_in

A5B5

A6

B6

A7B7

…

Delay = 4 * Delay ( 4-bit carry look-ahead ALU)

result 0-3

result 4-7

331 W07.26 Fall 2003

Hierarchical Solution II

Hierarchical solution I Group 32 bits into 8 4-bit groups Within each group, use carry look ahead Use 4-bit as a building block, and connect them in ripple

carry fashion.

Hierarchical solution II Group 32 bits into 8 4-bit groups Within each group, use carry look ahead Another level of carry look ahead is used to connect

these 4-bit groups

331 W07.27 Fall 2003

Hierarchical Solution IIA0B0

A3B3

A4B4

A7B7

A8B8

A11B11

A12B12

A15B15

cin

P0

G0

P1

G1

P2

G2

P3

G3

result 0-3

result 4-7

result 8-11

result 12-15

pi

gi

ci+1

C1

pi+1

gi+1

pi+2

pi+3

gi+2

gi+3

ci+2C2

ci+3C3

ci+3

cout

Carry-lookahead unit

•input a0-a15, b0-b15

•calculate P0-P3, G0-G3

•Calculate C1-C4

•each 4-bit ALU calculates its results

331 W07.28 Fall 2003

Fast Carry using the second level abstraction P0 = p3.p2.p1.p0

P1 = p7.p6.p5.p4

P2 = p11.p10.p9.p8

P3 = p15.p14.p13.p12

G0 = g3+(p3.g2) + (p3.p2.g1) + (p3.p2.p1.g0)

G1 = g7+(p7.g6) + (p7.p6.g5) + (p7.p6.p5.g4)

G2 = g11+(p11.g10)+(p11.p10.g9) + (p11.p10.p9.g8)

G3 = g15+(p15.g14)+(p15.p14.g3)+(p15.p14.p3.g12)

C1 = G+(P0•c0)

C2 = G1+(P1•G0)+(P1•P0•c0)

C3 = G2+(P2•G1)+(P2•P1•G0)+(P2•P1•P0•c0)

C4 = G3+(P3•G2)+(P3•P2•G1)+(P3•P2•P1•G0) + (P3•P2•P1•P0•c0)

331 W07.29 Fall 2003

Shift Operations Also need operations to pack and unpack 8-bit

characters into 32-bit words

Shifts move all the bits in a word left or right

sll $t2, $s0, 8 #$t2 = $s0 << 8 bits

srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits

Such shifts are logical because they fill with zeros

op rs rt rd shamt funct

000000 00000 10000 01010 01000 000000

000000 00000 10000 01010 01000 000010

331 W07.30 Fall 2003

Shift Operations, con’t

An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value)

so sra uses the most significant bit (sign bit) as the bit shifted in

note that there is no need for a sla when using two’s complement number representation

sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits

The shift operation is implemented by hardware (usually a barrel shifter) outside the ALU

000000 00000 10000 01010 01000 000011

331 W07.31 Fall 2003

More complicated than addition accomplished via shifting and addition

0010 (multiplicand) x_1011 (multiplier)

0010 0010 (partial product

0000 array) 0010 00010110 (product)

Double precision product produced

More time and more area to compute

Multiplication

331 W07.32 Fall 2003

mult $s0, $s1 # hi||lo = $s0 * $s1

Low-order word of the product is left in processor register lo and the high-order word is left in register hi

Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file

MIPS Multiply Instruction

op rs rt rd shamt funct

000000 10000 10001 00000 00000 011000

331 W07.33 Fall 2003

Review: MIPS ISA, so farCategory Instr Op Code Example Meaning

Arithmetic

(R & I format)

add 0 and 32 add $s1, $s2, $s3 $s1 = $s2 + $s3

add unsigned 0 and 33 addu $s1, $s2, $s3 $s1 = $s2 + $s3

subtract 0 and 34 sub $s1, $s2, $s3 $s1 = $s2 - $s3

subt unsigned 0 and 35 subu $s1, $s2, $s3 $s1 = $s2 - $s3

add immediate 8 addi $s1, $s2, 6 $s1 = $s2 + 6

add imm. unsigned 9 addiu $s1, $s2, 6 $s1 = $s2 + 6

multiply 0 and 24 mult $s1, $s2 hi || lo = $s1 * $s2

multiply unsigned 0 and 25 multu $s1, $s2 hi || lo = $s1 * $s2

divide 0 and 26 div $s1, $s2 lo = $s1/$s2, rem. in hi

divide unsigned 0 and 27 divu $s1, $s2 lo = $s1/$s2, rem. in hi

Logical

(R & I format)

and 0 and 36 and $s1, $s2, $s3 $s1 = $s2 & $s3

or 0 and 37 or $s1, $s2, $s3 $s1 = $s2 | $s3

xor 0 and 38 xor $s1, $s2, $s3 $s1 = $s2 xor $s3

nor 0 and 39 nor $s1, $s3, $s3 $s1 = !($s2 | $s2)

and immediate 12 andi $s1, $s2, 6 $s1 = $s2 & 6

or immediate 13 ori $s1, $s2, 6 $s1 = $s2 | 6

xor immediate 14 xori $s1, $s2, 6 $s1 = $s2 xor 6

331 W07.34 Fall 2003

Review: MIPS ISA, so far con’tCategory Instr Op Code Example Meaning

Shift

(R format)

sll 0 and 0 sll $s1, $s2, 4 $s1 = $s2 << 4

srl 0 and 2 srl $s1, $s2, 4 $s1 = $s2 >> 4

sra 0 and 3 sra $s1, $s2, 4 $s1 = $s2 >> 4

Data Transfer

(I format)

load word 35 lw $s1, 24($s2) $s1 = Memory($s2+24)

store word 43 sw $s1, 24($s2) Memory($s2+24) = $s1

load byte 32 lb $s1, 25($s2) $s1 = Memory($s2+25)

load byte unsigned 36 lbu $s1, 25($s2) $s1 = Memory($s2+25)

store byte 40 sb $s1, 25($s2) Memory($s2+25) = $s1

load upper imm 15 lui $s1, 6 $s1 = 6 * 216

move from hi 0 and 16 mfhi $s1 $s1 = hi

move to hi 0 and 17 mthi $s1 hi = $s1

move from lo 0 and 18 mflo $s1 $s1 = lo

move to lo 0 and 19 mtlo $s1 lo = $s1

331 W07.35 Fall 2003

Review: MIPS ISA, so far con’tCategory Instr Op Code Example Meaning

Cond. Branch

(I & R format)

br on equal 4 beq $s1, $s2, L if ($s1==$s2) go to L

br on not equal 5 bne $s1, $s2, L if ($s1 !=$s2) go to L

set on less than 0 and 42 slt $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0

set on less than unsigned

0 and 43 sltu $s1, $s2, $s3

if ($s2<$s3) $s1=1 else $s1=0

set on less than immediate

10 slti $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0

set on less than imm. unsigned

11 sltiu $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0

Uncond. Jump (J & R format)

jump 2 j 2500 go to 10000

jump and link 3 jal 2500 go to 10000; $ra=PC+4

jump register 0 and 8 jr $s1 go to $s1

jump and link reg 0 and 9 jalr $s1, $s2 go to $s1, $s2=PC+4

Documents

14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 7