View
212
Download
0
Embed Size (px)
Citation preview
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
ECE 300
Advanced VLSI DesignFall 2006
Lecture 17: Datapath Design
& AddersYunsi Fei[Adapted from Jan Rabaey et al’s Digital Integrated
Circuits ©2002, PSU Irwin & Vijay © 2002, and Princeton Wayne Wolf’s Modern VLSI Design © 2002 ]
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Major Components of a Computer
Processor
Control
Datapath
Memory
Devices
Input
Output
Modern processor architecture styles– Pipelined, single issue (e.g., ARM)
– Pipelined, hardware controlled multiple issue – superscalar
– Pipelined, software controlled multiple issue – VLIW
– Pipelined, multiple issue from multiple process threads - multithreaded
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Basic Building Blocks
Datapath– Execution units
» Adder, multiplier, divider, shifter, etc.
– Register file and pipeline registers
– Multiplexers, decoders
Control– Finite state machines (PLA, ROM, random logic)
Interconnect– Switches, arbiters, buses
Memory– Caches, TLBs, DRAM, buffers
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
MIPS 5-Stage Pipelined (Single Issue) Datapath
ReadAddress
I$
Add
PC
4
0
1
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
SignExtend16 32
ALU
1
0
Shiftleft 2
Add
D$Address
Write Data
ReadData
1
0
IF/D
ec
De
c/E
xe
c
Ex
ec
/Me
m
Me
m/W
B
pipelinestage
isolationregister
Fetch Decode Execute Memory WriteBack
clk
Icacheprecharge
Dcacheprecharge
RegWrite
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Datapath Bit-Sliced Organization
Control Flow
Bit 0
Bit 1
Bit 2
Bit 3
Tile identical bit-slice elements
Re
gis
ter
File
Pip
elin
e R
egis
ter
Ad
der
Sh
ifter
Pip
elin
e R
egis
ter
Mu
ltip
lexe
r
Mu
ltip
lexe
r
Data Flow
Pip
elin
e R
egis
ter
From I$
Pip
elin
e R
egis
ter
To/From D$
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Adders
Carry-ripple Manchester carry chain Carry skip Carry select Carry look ahead
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
The 1-bit Binary Adder
1-bit Full Adder(FA)
A
B
S
Cin
S = A B Cin
Cout = A&B | A&Cin | B&Cin (majority function)
How can we use it to build a 64-bit adder?
How can we modify it easily to build an adder/subtractor?
How can we make it better (faster, lower power, smaller)?
A B Cin Cout S carry status
0 0 0 0 0 kill
0 0 1 0 1 kill
0 1 0 0 1 propagate
0 1 1 1 0 propagate
1 0 0 0 1 propagate
1 0 1 1 0 propagate
1 1 0 1 0 generate
1 1 1 1 1 generate
Cout
G = A&BP = A BK = !A & !B
= P Cin
= G | P&Cin
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Delay Balanced FA
B !B
Identical Delays for Carry and Sum
P !P
Signal set-up
B
A
!B
pA
Carry generation
Sum generation
Cin
!P
A
!Cout
!P
P
Cin
P
A
!Cout
P
!P
SCin Cin
20+2 transistors
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
A 64-bit Adder/Subtractor
1-bit FA S0
C0=Cin
C1
1-bit FA S1
C2
1-bit FA S2
C3
C64=Cout
1-bit FA S63
C63
. .
.
Ripple Carry Adder (RCA) built out of 64 FAs
Subtraction – complement all subtrahend bits (xor gates) and set the low order carry-in
RCA
advantage: simple logic, so small (low cost)
disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption)
A0
B0
A1
B1
A2
B2
A63
B63
add/subt
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Ripple Carry Adder (RCA)
A0 B0
S0
C0=CinFA
A1 B1
S1
FA
A2 B2
S2
FA
A3 B3
S3
FACout=C4
T = O(N) worst case delay
Tadder TFA(A,BCout) + (N-2)TFA(CinCout) + TFA(CinS)
Real Goal: Make the fastest possible carry path
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Inversion Property
A B
S
CinFA
!Cout (A, B, Cin) = Cout (!A, !B, !Cin)
Cout
A B
S
FACout Cin
!S (A, B, Cin) = S(!A, !B, !Cin)
Inverting all inputs to a FA results in inverted values for all outputs
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Exploiting the Inversion Property
A0 B0
S0
C0=CinFA’
A1 B1
S1
FA’
A2 B2
S2
FA’
A3 B3
S3
FA’Cout=C4
Now need two “flavors” of FAs
regular cellinverted cell Minimizes the critical path (the carry chain) by eliminating inverters between the FAs
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Fast Carry Chain Design
The key to fast addition is a low latency carry network What matters is whether in a given position a carry is
– generated Gi = Ai & Bi = AiBi
– propagated Pi = Ai Bi (sometimes use Ai | Bi)
– annihilated (killed) Ki = !Ai & !Bi
Giving a carry recurrence of Ci+1 = Gi | PiCi
C1 = G0 | P0C0
C2 = G1 | P1G0 | P1P0 C0
C3 = G2 | P2G1 | P2P1G0 | P2P1P0 C0
C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0 C0
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Manchester Carry Chain
Switches controlled by Gi and Pi
Total delay of– time to form the switch control signals Gi and Pi
– setup time for the switches– signal propagation delay through N switches in the worst case
Gi Pi
!Ci!Ci+1
clk
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
4-bit Sliced MCC Adder
G P
!C0
clk
G PG PG P
& & & &
A0 B0A1 B1A2 B2A3 B3
S0S1S2S3
!C1!C2!C3
!C4
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Domino Manchester Carry Chain Circuit
Ci,0G0
clk
clkP0P1P2P3
G1G2G3
Ci,41 2 3 4
5
6
3 3 3 3 3
1
2
2
3
3
4
4
5
!(G0 | P0 Ci,0)
!(G1 | P1G0 | P1P0 Ci,0)
!(G2 | P2G1 | P2P1G0 | P2P1P0 Ci,0)
!(G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0 Ci,0)
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Binary Adder Landscapesynchronous word parallel adders
ripple carry adders (RCA) carry prop min adders
signed-digit fast carry prop residue adders adders adders
Manchester carry parallel conditional carry carry chain select prefix sum skip
T = O(N), A = O(N)
T = O(1), A = O(N)
T = O(log N)A = O(N log N)
T = O(N), A = O(N)T = O(N)
A = O(N)
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-Skip (Carry-Bypass) Adder
If (P0 & P1 & P2 & P3 = 1) then Co,3 = Ci,0 otherwise the block itself kills or generates the carry internally
A0 B0
S0
Ci,0FA
A1 B1
S1
FA
A2 B2
S2
FA
A3 B3
S3
FACo,3
Co,3
BP = P0 P1 P2 P3 “Block Propagate”
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-Skip Chain Implementation
BPblock carry-in
block carry-outcarry-out
Cin
G0
P0P1P2P3
G1G2G3
!Cout
BP
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
4-bit Block Carry-Skip Adder
Worst-case delay carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (B is the group size in bits), ripples in the last group from bit 12 to bit 15
Ci,0
Sum
CarryPropagation
Setup
Sum
CarryPropagation
Setup
Sum
CarryPropagation
Setup
Sum
CarryPropagation
Setup
bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15
Tadd = tsetup + B tcarry + ((N/B) -1) tskip +B tcarry + tsum
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Optimal Block Size and Time
Assuming one stage of ripple (tcarry) has the same delay as one skip logic stage (tskip) and both are 1
TCSkA = 1 + B + (N/B-1) + B + 1
tsetup ripple in skips ripple in tsum
block 0 last block
= 2B + N/B + 1 So the optimal block size, B, is
dTCSkA/dB = 0 (N/2) = Bopt
And the optimal time is
Optimal TCSkA = 2((2N)) + 1
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-Skip Adder Extensions Variable block sizes
– A carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks, so can have bigger blocks for the inner carries without increasing the overall delay
CinCout
Multiple levels of skip logic
skip level 1
skip level 2
CinCout
AND of the first level skip signals (BP’s)
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-Skip Adder Comparisons
0
10
20
30
40
50
60
70
8 bits 16 bits 32 bits 48 bits 64 bits
RCA
CSkA
VSkA
B=2 B=3B=4
B=5B=6
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Parallel Prefix Adders (PPAs)
Define carry operator € on (G,P) signal pairs
– € is associative, i.e.,
[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]
€
(G’’,P’’) (G’,P’)
(G,P)
where G = G’’ P’’G’ P = P’’P’
€
€ €
€
G’
!G
G’’
P’’
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
PPA General Structure Given P and G terms for each bit position, computing all the
carries is equal to finding all the prefixes in parallel
(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)
Since € is associative, we can group them in any order – but note that it is not commutative
Measures to consider– number of € cells
– tree cell depth (time)
– tree cell area
– cell fan-in and fan-out
– max wiring length
– wiring congestion
– delay path variation (glitching)
Pi, Gi logic (1 unit delay)
Si logic (1 unit delay)
Ci parallel prefix logic tree (1 unit delay per level)
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Brent-Kung PPAP
aral
lel P
refix
Com
puta
tion
€
G0
P0
G1
P1
G2
p2
G3
P3
G4
P4
G5
P5
G6
P6
G7
P7
G8
P8
G9
p9
G10
P10
G11
p11
G12
P12
G13
p13
G14
p14
G15
p15
€€€€€€€
€ € € €
€
€
€
€
€
€
€ € € € € €
€ €
C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16
Cin
€
T =
log 2
NT
= lo
g 2N
- 2
A =
2lo
g 2N
A = N/2
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Kogge-Stone PPF AdderP
aral
lel P
refix
Com
puta
tion
€
G0
P0
G1
P1
G2
P2
G3
P3
G4
P4
G5
P5
G6
P6
G7
P7
G8
P8
G9
P9
G10
P10
G11
P11
G12
P12
G13
P13
G14
P14
G15
P15
€€€€€€€
€ € € €
€
€
€
€
C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16
Cin
€
T =
log 2
N
A =
log 2
N
A = N
€€€€€€€
€ € € € € € € € € €
€ € € € € € € € € €
€ € € € € €
Tadd = tsetup + log2N t€ + tsum
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
More Adder Comparisons
0
10
20
30
40
50
60
70
8 bits 16 bits 32 bits 48 bits 64 bits
RCA
CSkA
VSkA
KS PPA
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Topics
Adders and ALUs (§6.4, §6.5)– Carry-ripple– Carry look ahead– Manchester carry chain– Carry skip– Carry select
Multipliers (§6.6) Subsystem design principles (§6.2)
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Adders
1-bit full adder– Si = ai bi ci
– ci+1 = aibi + aici + bici
Carry-ripple adder– n-bit adder built from full adders
Adder delay is dominated by carry chain– Naming: Carry- … adder
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
1-bit Full Adder: the Mirror Adder
VDD
Ci
A
BBA
B
A
A B
VDD
Ci
A B Ci
Ci
B
A
Ci
A
BBA
VDD
SCo
24 transistors
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-lookahead Adder
First compute carry propagate, generate:– Pi = ai + bi
– Gi = ai bi
Compute sum and carry from P and G:– Si = ci Pi Gi = ai bi ci
– ci+1 = Gi + Pici
= Gi + PiGi-1 + PiPi-1 Gi-2 + … +Pi …Pj cj
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Depth-4 Carry-lookahead
C1= G0 + P0Cin
C2= G1 + P1 G0 + P1P0Cin
C3= G2 + P2G1+P2P1 G0 + P2P1P0Cin
C4 = G3 + P3G2 + P3P2G1+ P3P2P1 G0 + P3P2P1P0Cin
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Analysis
Deepest carry expansion requires gates with large fanin: large, slow– Generally use 4-bit groups– Domino logic implementation
Carry look ahead tree– C4 = G3 + P3G2 + P3P2G1+ P3P2P1 G0 + P3P2P1P0Cin
» G* = G3 + P3G2 + P3P2G1+ P3P2P1 G0
» P* = P3P2P1P0
» C4 = G* + P*Cin
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Manchester Carry Chain Circuit
Gi-1
Pi-1
+
Gi
Pi
+
stage i-1 stage i
Ci+1Ci-1Ci
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Manchester Carry Chain
Precharged/evaluate carry chain Principles
– If Gi = aibi = 1, Pi = ai+bi = 0, Ci+1 = 1
– If Gi = aibi = 0, Pi = ai+bi = 0, Ci+1 = 0
– If Gi = aibi = 0, Pi = ai+bi = 1, Ci+1= Ci
Worst-case discharge path goes through entire carry chain.
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-skip Adder
For m-bit addition, its Cout can be– Inherited from Cin
» ai bi for every bit in stage
– Generated locally within m-bit» i.e. The Cout when Cin = 0
Optimum group size: m = sqrt(n/2)
Longest path:
– Similar to Manchester chain
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Two-bit Carry-skip Structure
ai
bi
ai+1
bi+1
Ci
ai+1 bi+1 + (ai+1+bi+1)aibi
Ci+2
or using a mux
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-skip Group Structure
M-bit FA
group
M-bit FA
group
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-select Adder
Computes two results in parallel, each for different carry input assumptions.
Uses actual carry in to select correct result. Reduces delay to multiplexer.
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Carry-select Structure
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
DEC “alpha” 21064 Adder
64-bit adder, 0.75m technology, 5ns delay
On the 8-bit level: Manchester chain On the 32-bit sub-block: Carry look ahead On the 64-bit block: Carry select
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Serial Adder
May be used in signal-processing arithmetic where fast computation is important but latency is unimportant.
Data format (LSB first):
bit 0bit 1bit 2bit 3...
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Serial adder Structure
LSB control signal clears the carry shift register:
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Subtraction
a – b = a + (-b) For an n-bit number b, how do we get its
complement?– (-b) = b + 1– a + (-b) = a + b + 1
» Using “1” as the carry-in to avoid two additions
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
ALUs
ALU computes a variety of logical and arithmetic functions based on opcode.– Shift
» Arithmetic/logical shift left, shift right
– Logic operations» AND, OR, NOT, …
– Add/subtract» Signed/unsigned, …
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
Opcodes
The control bits that determine the datapath– Whether it is a shift, add, subtract …
Must be carefully designed to ease decoding– Use decoder/de-multiplexer to select the correct
datapath
Digital Integrated Circuits 2e: Chapter 11.1-11.3 Copyright 2002 Prentice Hall PTR, Adapted by Yunsi Fei
An ALU Adder Structure