Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture

Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei

ECE 300

Advanced VLSI DesignFall 2006

Lecture 17: Datapath Design

& AddersYunsi Fei[Adapted from Jan Rabaey et al’s Digital Integrated

Circuits ©2002, PSU Irwin & Vijay © 2002, and Princeton Wayne Wolf’s Modern VLSI Design © 2002 ]


Major Components of a Computer

Processor

Control

Datapath

Memory

Devices

Input

Output

Modern processor architecture styles– Pipelined, single issue (e.g., ARM)

– Pipelined, hardware controlled multiple issue – superscalar

– Pipelined, software controlled multiple issue – VLIW

– Pipelined, multiple issue from multiple process threads - multithreaded


Basic Building Blocks

Datapath– Execution units

» Adder, multiplier, divider, shifter, etc.

– Register file and pipeline registers

– Multiplexers, decoders

Control– Finite state machines (PLA, ROM, random logic)

Interconnect– Switches, arbiters, buses

Memory– Caches, TLBs, DRAM, buffers


MIPS 5-Stage Pipelined (Single Issue) Datapath

ReadAddress

I$

Add

PC

4

0

1

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

SignExtend16 32

ALU

1

0

Shiftleft 2

Add

D$Address

Write Data

ReadData

1

0

IF/D

ec

De

c/E

xe

c

Ex

ec

/Me

m

Me

m/W

B

pipelinestage

isolationregister

Fetch Decode Execute Memory WriteBack

clk

Icacheprecharge

Dcacheprecharge

RegWrite


Datapath Bit-Sliced Organization

Control Flow

Bit 0

Bit 1

Bit 2

Bit 3

Tile identical bit-slice elements

Re

gis

ter

File

Pip

elin

e R

egis

ter

Ad

der

Sh

ifter

Pip

elin

e R

egis

ter

Mu

ltip

lexe

r

Mu

ltip

lexe

r

Data Flow

Pip

elin

e R

egis

ter

From I$

Pip

elin

e R

egis

ter

To/From D$


Adders

Carry-ripple Manchester carry chain Carry skip Carry select Carry look ahead


The 1-bit Binary Adder

1-bit Full Adder(FA)

A

B

S

Cin

S = A B Cin

Cout = A&B | A&Cin | B&Cin (majority function)

How can we use it to build a 64-bit adder?

How can we modify it easily to build an adder/subtractor?

How can we make it better (faster, lower power, smaller)?

A B Cin Cout S carry status

0 0 0 0 0 kill

0 0 1 0 1 kill

0 1 0 0 1 propagate

0 1 1 1 0 propagate

1 0 0 0 1 propagate

1 0 1 1 0 propagate

1 1 0 1 0 generate

1 1 1 1 1 generate

Cout

G = A&BP = A BK = !A & !B

= P Cin

= G | P&Cin


Delay Balanced FA

B !B

Identical Delays for Carry and Sum

P !P

Signal set-up

B

A

!B

pA

Carry generation

Sum generation

Cin

!P

A

!Cout

!P

P

Cin

P

A

!Cout

P

!P

SCin Cin

20+2 transistors


A 64-bit Adder/Subtractor

1-bit FA S0

C0=Cin

C1

1-bit FA S1

C2

1-bit FA S2

C3

C64=Cout

1-bit FA S63

C63

. .

.

Ripple Carry Adder (RCA) built out of 64 FAs

Subtraction – complement all subtrahend bits (xor gates) and set the low order carry-in

RCA

advantage: simple logic, so small (low cost)

disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption)

A0

B0

A1

B1

A2

B2

A63

B63

add/subt


Ripple Carry Adder (RCA)

A0 B0

S0

C0=CinFA

A1 B1

S1

FA

A2 B2

S2

FA

A3 B3

S3

FACout=C4

T = O(N) worst case delay

Tadder TFA(A,BCout) + (N-2)TFA(CinCout) + TFA(CinS)

Real Goal: Make the fastest possible carry path


Inversion Property

A B

S

CinFA

!Cout (A, B, Cin) = Cout (!A, !B, !Cin)

Cout

A B

S

FACout Cin

!S (A, B, Cin) = S(!A, !B, !Cin)

Inverting all inputs to a FA results in inverted values for all outputs


Exploiting the Inversion Property

A0 B0

S0

C0=CinFA’

A1 B1

S1

FA’

A2 B2

S2

FA’

A3 B3

S3

FA’Cout=C4

Now need two “flavors” of FAs

regular cellinverted cell Minimizes the critical path (the carry chain) by eliminating inverters between the FAs


Fast Carry Chain Design

The key to fast addition is a low latency carry network What matters is whether in a given position a carry is

– generated Gi = Ai & Bi = AiBi

– propagated Pi = Ai Bi (sometimes use Ai | Bi)

– annihilated (killed) Ki = !Ai & !Bi

Giving a carry recurrence of Ci+1 = Gi | PiCi

C1 = G0 | P0C0

C2 = G1 | P1G0 | P1P0 C0

C3 = G2 | P2G1 | P2P1G0 | P2P1P0 C0

C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0 C0


Manchester Carry Chain

Switches controlled by Gi and Pi

Total delay of– time to form the switch control signals Gi and Pi

– setup time for the switches– signal propagation delay through N switches in the worst case

Gi Pi

!Ci!Ci+1

clk


4-bit Sliced MCC Adder

G P

!C0

clk

G PG PG P

& & & &

A0 B0A1 B1A2 B2A3 B3

S0S1S2S3

!C1!C2!C3

!C4


Domino Manchester Carry Chain Circuit

Ci,0G0

clk

clkP0P1P2P3

G1G2G3

Ci,41 2 3 4

5

6

3 3 3 3 3

1

2

2

3

3

4

4

5

!(G0 | P0 Ci,0)

!(G1 | P1G0 | P1P0 Ci,0)

!(G2 | P2G1 | P2P1G0 | P2P1P0 Ci,0)

!(G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0 Ci,0)


Binary Adder Landscapesynchronous word parallel adders

ripple carry adders (RCA) carry prop min adders

signed-digit fast carry prop residue adders adders adders

Manchester carry parallel conditional carry carry chain select prefix sum skip

T = O(N), A = O(N)

T = O(1), A = O(N)

T = O(log N)A = O(N log N)

T = O(N), A = O(N)T = O(N)

A = O(N)


Carry-Skip (Carry-Bypass) Adder

If (P0 & P1 & P2 & P3 = 1) then Co,3 = Ci,0 otherwise the block itself kills or generates the carry internally

A0 B0

S0

Ci,0FA

A1 B1

S1

FA

A2 B2

S2

FA

A3 B3

S3

FACo,3

Co,3

BP = P0 P1 P2 P3 “Block Propagate”


Carry-Skip Chain Implementation

BPblock carry-in

block carry-outcarry-out

Cin

G0

P0P1P2P3

G1G2G3

!Cout

BP


4-bit Block Carry-Skip Adder

Worst-case delay carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (B is the group size in bits), ripples in the last group from bit 12 to bit 15

Ci,0

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15

Tadd = tsetup + B tcarry + ((N/B) -1) tskip +B tcarry + tsum


Optimal Block Size and Time

Assuming one stage of ripple (tcarry) has the same delay as one skip logic stage (tskip) and both are 1

TCSkA = 1 + B + (N/B-1) + B + 1

tsetup ripple in skips ripple in tsum

block 0 last block

= 2B + N/B + 1 So the optimal block size, B, is

dTCSkA/dB = 0 (N/2) = Bopt

And the optimal time is

Optimal TCSkA = 2((2N)) + 1


Carry-Skip Adder Extensions Variable block sizes

– A carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks, so can have bigger blocks for the inner carries without increasing the overall delay

CinCout

Multiple levels of skip logic

skip level 1

skip level 2

CinCout

AND of the first level skip signals (BP’s)


Carry-Skip Adder Comparisons

0

10

20

30

40

50

60

70

8 bits 16 bits 32 bits 48 bits 64 bits

RCA

CSkA

VSkA

B=2 B=3B=4

B=5B=6


Parallel Prefix Adders (PPAs)

Define carry operator € on (G,P) signal pairs

– € is associative, i.e.,

[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]

€

(G’’,P’’) (G’,P’)

(G,P)

where G = G’’ P’’G’ P = P’’P’

€

€ €

€

G’

!G

G’’

P’’


PPA General Structure Given P and G terms for each bit position, computing all the

carries is equal to finding all the prefixes in parallel

(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)

Since € is associative, we can group them in any order – but note that it is not commutative

Measures to consider– number of € cells

– tree cell depth (time)

– tree cell area

– cell fan-in and fan-out

– max wiring length

– wiring congestion

– delay path variation (glitching)

Pi, Gi logic (1 unit delay)

Si logic (1 unit delay)

Ci parallel prefix logic tree (1 unit delay per level)


Brent-Kung PPAP

aral

lel P

refix

Com

puta

tion

€

G0

P0

G1

P1

G2

p2

G3

P3

G4

P4

G5

P5

G6

P6

G7

P7

G8

P8

G9

p9

G10

P10

G11

p11

G12

P12

G13

p13

G14

p14

G15

p15

€€€€€€€

€ € € €

€

€

€

€

€

€

€ € € € € €

€ €

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

Cin

€

T =

log 2

NT

= lo

g 2N

- 2

A =

2lo

g 2N

A = N/2


Kogge-Stone PPF AdderP

aral

lel P

refix

Com

puta

tion

€

G0

P0

G1

P1

G2

P2

G3

P3

G4

P4

G5

P5

G6

P6

G7

P7

G8

P8

G9

P9

G10

P10

G11

P11

G12

P12

G13

P13

G14

P14

G15

P15

€€€€€€€

€ € € €

€

€

€

€

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

Cin

€

T =

log 2

N

A =

log 2

N

A = N

€€€€€€€

€ € € € € € € € € €

€ € € € € € € € € €

€ € € € € €

Tadd = tsetup + log2N t€ + tsum


More Adder Comparisons

0

10

20

30

40

50

60

70

8 bits 16 bits 32 bits 48 bits 64 bits

RCA

CSkA

VSkA

KS PPA


Topics

Adders and ALUs (§6.4, §6.5)– Carry-ripple– Carry look ahead– Manchester carry chain– Carry skip– Carry select

Multipliers (§6.6) Subsystem design principles (§6.2)


Adders

1-bit full adder– Si = ai bi ci

– ci+1 = aibi + aici + bici

Carry-ripple adder– n-bit adder built from full adders

Adder delay is dominated by carry chain– Naming: Carry- … adder


1-bit Full Adder: the Mirror Adder

VDD

Ci

A

BBA

B

A

A B

VDD

Ci

A B Ci

Ci

B

A

Ci

A

BBA

VDD

SCo

24 transistors


Carry-lookahead Adder

First compute carry propagate, generate:– Pi = ai + bi

– Gi = ai bi

Compute sum and carry from P and G:– Si = ci Pi Gi = ai bi ci

– ci+1 = Gi + Pici

= Gi + PiGi-1 + PiPi-1 Gi-2 + … +Pi …Pj cj


Depth-4 Carry-lookahead

C1= G0 + P0Cin

C2= G1 + P1 G0 + P1P0Cin

C3= G2 + P2G1+P2P1 G0 + P2P1P0Cin

C4 = G3 + P3G2 + P3P2G1+ P3P2P1 G0 + P3P2P1P0Cin


Analysis

Deepest carry expansion requires gates with large fanin: large, slow– Generally use 4-bit groups– Domino logic implementation

Carry look ahead tree– C4 = G3 + P3G2 + P3P2G1+ P3P2P1 G0 + P3P2P1P0Cin

» G* = G3 + P3G2 + P3P2G1+ P3P2P1 G0

» P* = P3P2P1P0

» C4 = G* + P*Cin


Manchester Carry Chain Circuit

Gi-1

Pi-1

+

Gi

Pi

+

stage i-1 stage i

Ci+1Ci-1Ci


Manchester Carry Chain

Precharged/evaluate carry chain Principles

– If Gi = aibi = 1, Pi = ai+bi = 0, Ci+1 = 1

– If Gi = aibi = 0, Pi = ai+bi = 0, Ci+1 = 0

– If Gi = aibi = 0, Pi = ai+bi = 1, Ci+1= Ci

Worst-case discharge path goes through entire carry chain.


Carry-skip Adder

For m-bit addition, its Cout can be– Inherited from Cin

» ai bi for every bit in stage

– Generated locally within m-bit» i.e. The Cout when Cin = 0

Optimum group size: m = sqrt(n/2)

Longest path:

– Similar to Manchester chain


Two-bit Carry-skip Structure

ai

bi

ai+1

bi+1

Ci

ai+1 bi+1 + (ai+1+bi+1)aibi

Ci+2

or using a mux


Carry-skip Group Structure

M-bit FA

group

M-bit FA

group


Carry-select Adder

Computes two results in parallel, each for different carry input assumptions.

Uses actual carry in to select correct result. Reduces delay to multiplexer.


Carry-select Structure


DEC “alpha” 21064 Adder

64-bit adder, 0.75m technology, 5ns delay

On the 8-bit level: Manchester chain On the 32-bit sub-block: Carry look ahead On the 64-bit block: Carry select


Serial Adder

May be used in signal-processing arithmetic where fast computation is important but latency is unimportant.

Data format (LSB first):

bit 0bit 1bit 2bit 3...


Serial adder Structure

LSB control signal clears the carry shift register:


Subtraction

a – b = a + (-b) For an n-bit number b, how do we get its

complement?– (-b) = b + 1– a + (-b) = a + b + 1

» Using “1” as the carry-in to avoid two additions


ALUs

ALU computes a variety of logical and arithmetic functions based on opcode.– Shift

» Arithmetic/logical shift left, shift right

– Logic operations» AND, OR, NOT, …

– Add/subtract» Signed/unsigned, …


Opcodes

The control bits that determine the datapath– Whether it is a shift, add, subtract …

Must be carefully designed to ease decoding– Use decoder/de-multiplexer to select the correct

datapath


An ALU Adder Structure

Documents

Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture