View
224
Download
2
Embed Size (px)
Citation preview
CSE241 R2 Datapath/Memory.1 Kahng & Cichy, UCSD ©2003
CSE241AVLSI Digital Circuits
Winter 2003
Recitation 02: Datapath
and Memory
CSE241 R2 Datapath/Memory.2 Kahng & Cichy, UCSD ©2003
Introduction: Basic Building Blocks
Datapath Execution units
- Adder, multiplier, divider, shifter, etc.
Register file and pipeline registers Multiplexers, decoders
Control Finite state machines (PLA, ROM, random logic)
Interconnect Switches, arbiters, buses – not covered
Memory Caches (SRAMs), TLBs, DRAMs, buffers
CSE241 R2 Datapath/Memory.3 Kahng & Cichy, UCSD ©2003
The 1-bit Binary Adder
1-bit Full Adder(FA)
A
BS
Cin
S = A B Cin
Cout = A&B | A&Cin | B&Cin (majority function)
How can we use it to build a 64-bit adder?
How can we modify it easily to build an adder/subtractor?
How can we make it better (faster, lower power, smaller)?
A B Cin CoutS carry status
0 0 0 0 0 kill
0 0 1 0 1 kill
0 1 0 0 1 propagate
0 1 1 1 0 propagate
1 0 0 0 1 propagate
1 0 1 1 0 propagate
1 1 0 1 0 generate
1 1 1 1 1 generate
Cout
G = A&BP = A BK = !A & !B
= P Cin
= G | P&Cin
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.4 Kahng & Cichy, UCSD ©2003
FA Gate Level Implementations
A B
S
Cout
Cin
t1 t0t2 t0
t1
A B
S
Cout
Cin
t2
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.5 Kahng & Cichy, UCSD ©2003
Review: XOR FA
Cout
S
Cin
A
B
16 transistors
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.6 Kahng & Cichy, UCSD ©2003
Ripple Carry Adder (RCA)
A0 B0
S0
C0=CinFA
A1 B1
S1
FA
A2 B2
S2
FA
A3 B3
S3
FACout=C4
T = O(N) worst case delay
Tadder TFA(A,BCout) + (N-2)TFA(CinCout) + TFA(CinS)
Real Goal: Make the fastest possible carry pathMax delay = tdelay = tsum + (N-1) tcarry
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.7 Kahng & Cichy, UCSD ©2003
Inversion Property
A B
S
CinFA
!Cout (A, B, Cin) = Cout (!A, !B, !Cin)
Cout
A B
S
FACout Cin
!S (A, B, Cin) = S(!A, !B, !Cin)
Inverting all inputs to a FA results in inverted values for all outputs
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.8 Kahng & Cichy, UCSD ©2003
Exploiting the Inversion Property
A0 B0
S0
C0=CinFA’
A1 B1
S1
FA’
A2 B2
S2
FA’
A3 B3
S3
FA’Cout=C4
Now need two “flavors” of FAs
regular cellinverted cell
• Minimizes the critical path (the carry chain) by eliminating inverters between the FAs (will need to increase the transistor sizing on the carry chain portion of the mirror adder).
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.10 Kahng & Cichy, UCSD ©2003
Fast Carry Chain Design
The key to fast addition is a low latency carry network
What matters is whether in a given position a carry is generated Gi = Ai & Bi = AiBi
propagated Pi = Ai Bi (sometimes use Ai | Bi) annihilated (killed) Ki = !Ai & !Bi
Giving a carry recurrence of
Ci+1 = Gi | PiCi
C1 = G0 | P0C0
C2 = G1 | P1G0 | P1P0 C0
C3 = G2 | P2G1 | P2P1G0 | P2P1P0 C0
C4 = G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0 C0
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.11 Kahng & Cichy, UCSD ©2003
Binary Adder Landscape
synchronous word parallel adders
ripple carry adders (RCA) carry prop min adders
signed-digit fast carry prop residue adders adders adders
Manchester carry parallel conditional carry carry chain select prefix sum skip
T = O(N), A = O(N)
T = O(1), A = O(N)
T = O(log N)A = O(N log N)
T = O(N), A = O(N)T = O(N)
A = O(N)
CSE241 R2 Datapath/Memory.12 Kahng & Cichy, UCSD ©2003
Parallel Prefix Adders (PPAs) Define carry operator € on (G,P) signal pairs
€ is associative, i.e.,
[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]
€
(G’’,P’’) (G’,P’)
(G,P)
where G = G’’ P’’G’ P = P’’P’
€
€ €
€
G’
!G
G’’
P’’
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.13 Kahng & Cichy, UCSD ©2003
PPA General Structure Given P and G terms for each bit position, computing all
the carries is equal to finding all the prefixes in parallel
(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)
Since € is associative, we can group them in any order but note that it is not commutative
Measures to consider number of € cells tree cell depth (time) tree cell area cell fan-in and fan-out max wiring length wiring congestion delay path variation
(glitching)
Pi, Gi logic (1 unit delay)
Si logic (1 unit delay)
Ci parallel prefix logic tree (1 unit delay per level)
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.14 Kahng & Cichy, UCSD ©2003
Adder Types
RCA = Ripple Carry
MCC = Manchester Carry Chain
CCSka = Carry-Chain haSave
VCSka =
CCSia = Carry-Chain Save with Invert
BK = Brent Kung
Others: (array type) Ling-Ling ELM Kogge-Stone
CSE241 R2 Datapath/Memory.18 Kahng & Cichy, UCSD ©2003
Review: Basic Building Blocks
Datapath Execution units
- Adder, multiplier, divider, shifter
- Register file and pipeline registers Multiplexers, decoders
Control Finite state machines (PLA, ROM, random logic)
Memory SRAM cell DRAM Other types
CSE241 R2 Datapath/Memory.19 Kahng & Cichy, UCSD ©2003
Parallel Programmable Shifters
Dat
a In
Control =
Dat
a O
ut
Shift amountShift directionShift type (logical, arith, circular)
Shifters used in multipliers, floating point units
Consume lots of area if done in random logic gates
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.20 Kahng & Cichy, UCSD ©2003
Shifters - Applications
Linear shifting Concatenate 2 words (N-bits) and pull out a contiguous N-bit word.
Take an portion of a word and shift to to the left or right- Multiply by 2M
- Pad the emptied position with 0’s or 1’s
- Arithmetic shifts
– Left shift, pad 0’s
– Right shift, pad 1’s
Barrel shifting Emptied position filled with bit dropped off. Rotational shifting… circular convolution.
wordA wordB
wordC
Slide courtesy of Ken Yang, UCLA
CSE241 R2 Datapath/Memory.21 Kahng & Cichy, UCSD ©2003
A Programmable Binary Shifter
rgt nop left
Ai
Ai-1 Bi-1
Bi
Ai Ai-1rgt nop left Bi Bi-1
A1 A00 1 0 A1 A0
A1 A01 0 0 0 A1
A1 A00 0 1 A0
0
•Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.23 Kahng & Cichy, UCSD ©2003
4-bit Barrel Shifter
A0
A1
A2
A3
B0
B1
B2
B3
Sh1
Sh2
Sh3
Sh0 Sh1 Sh2 Sh3
Example: Sh0 = 1 B3B2B1B0 = A3A2A1A0
Sh1 = 1 B3B2B1B0 = A3A3A2A1
Sh2 = 1 B3B2B1B0 = A3A3A3A2
Sh3 = 1 B3B2B1B0 = A3A3A3A3
Area dominated by wiring
•Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.25 Kahng & Cichy, UCSD ©2003
4-bit Barrel Shifter Layout
BufferSh3Sh2Sh1Sh0
A3
A2
A1
A0
Widthbarrel ~ 2 pm N N = max shift distance, pm = metal pitchDelay ~ 1 fet + N diff caps
Widthbarrel
Only one Sh#active at a timel
•Slide courtesy of Mary Jane Irwin, Penn state
•multiplier•multiplier
CSE241 R2 Datapath/Memory.26 Kahng & Cichy, UCSD ©2003
Review: Basic Building Blocks
Datapath Execution units
- Adder, multiplier, divider, shifter, etc.
Register file and pipeline registers
Memories SRAM cell DRAM Other types
CSE241 R2 Datapath/Memory.27 Kahng & Cichy, UCSD ©2003
Multiplication Binary multiplication
Same with 2’s complement Sign-extend the negative.
2’s complement N-bit numbers Rhombus of N partial
products Product has 2N number of
bits. Negative multiplier
- Last term is equivalent to 2’s complement.
Sign extension is tricky
- Drop 1’s into sign bit if 0’s
- Otherwise invert sign bit.
10011
x 10001
---------
111110011
00000
00000
00000
10011
10011
10011
10011
10011
----------
011000011
Multiplicand(B) = -13
Multiplier(A) = -15
195
Multiplicand*(20+21+22+…)
= Multiplcand*(1111…)
= -1*Multiplicand
= 01101
Nine bits + 1 sign.
Partial products
Slide courtesy of Ken Yang, UCLA
CSE241 R2 Datapath/Memory.28 Kahng & Cichy, UCSD ©2003
Parallel Multipliers
Each partial product is independent.
Multiply with 2 steps. First step: generate partial
products in parallel. Second step: add the
partial products.
Generating the Partial Products PPI,J = AI AND BJ
Sign bit is a little different.
- SI,N = B(sign)’ NAND A(sign)
A0
A1
A2
B0_N-1
PP00 PP01 PP02
PP10 PP11 PP12
Slide courtesy of Ken Yang, UCLA
CSE241 R2 Datapath/Memory.29 Kahng & Cichy, UCSD ©2003
Review: Basic Building Blocks
Datapath Execution units
- Adder, multiplier, divider, shifter, etc.
Register file and pipeline registers
Memories SRAM cell
- 6T
DRAM - 1T
Other types- 1T SRAM
CSE241 R2 Datapath/Memory.30 Kahng & Cichy, UCSD ©2003
Semiconductor Memories
RWM
Read Write Memory
NVRWM
Non Volatile
ROM
Read Only Random Access
Non-Random Access
EPROM Mask-programmed
SRAM (cache,
register file)
FIFO/LIFO E2PROM
DRAM Shift Register
CAM
FLASH Electrically-programmed
(PROM)
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.31 Kahng & Cichy, UCSD ©2003
SecondLevelCache
(SRAM)
A Typical Memory Hierarchy
Control
Datapath
SecondaryMemory(Disk)
On-Chip Components
Reg
File
MainMemory(DRAM)
Data
Cach
eIn
strC
ache
ITL
BD
TL
B
eDRAM
Speed (ns): .1’s 1’s 10’s 100’s 1,000’sSize (bytes): 100’s K’s 10K’s M’s T’s
Cost: highest lowest
By taking advantage of the principle of locality: Present the user with as much memory as is available in the
cheapest technology. Provide access at the speed offered by the fastest technology.
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.32 Kahng & Cichy, UCSD ©2003
Access Time comparison
Type Time (ns)
RDRAM 30ns
SDRAM 20ns
SRAM 10ns
FLASH 80ns (.15.u)
FRAM 10ns
ROM (read) 50ns
Latency•Time to read
Bandwidth•Throughput of system
(Generalized ~.13u)
CSE241 R2 Datapath/Memory.33 Kahng & Cichy, UCSD ©2003
Embedded RAM
SRAMs and DRAMs
SRAM DRAM6-T / 4-T memory cell Capacitor based storage.
High Density
Low Power – important requirement for system on chip
Refresh cycles required – hence high power
Slower Data Access Fast Access cycles
Relative transistor sizes determine Noise Margin
Capacitor size determines Noise Margin
Noise Margin Important figure of merit Degraded with scaling
CSE241 R2 Datapath/Memory.34 Kahng & Cichy, UCSD ©2003
Read-Write Memories (RAMs) Static – SRAM
data is stored as long as supply is applied large cells (6 fets/cell) – so fewer bits/chip fast – so used where speed is important (e.g., caches) differential outputs (output BL and !BL) use sense amps for performance compatible with CMOS technology
Dynamic – DRAM periodic refresh required small cells (1 to 3 fets/cell) – so more bits/chip slower – so used for main memories single ended output (output BL only) need sense amps for correct operation not typically compatible with CMOS technology
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.35 Kahng & Cichy, UCSD ©2003
6-transistor SRAM Cell
!BL BL
WL
M1
M2
M3
M4
M5M6Q
!Q
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.36 Kahng & Cichy, UCSD ©2003
SRAM Cell Analysis (Read)
!BL=1 BL=1
WL=1
M1
M4
M5M6
Q=1!Q=0
CbitCbit
Read-disturb (read-upset): must carefully limit the allowed voltage rise on !Q to a value that prevents the read-upset condition from occurring while simultaneously maintaining acceptable circuit speed and area constraints
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.37 Kahng & Cichy, UCSD ©2003
SRAM Cell Analysis (Read)
!BL=1 BL=1
WL=1
M1
M4
M5M6
Q=1!Q=0
CbitCbit
Cell Ratio (CR) = (WM1/LM1)/(WM5/LM5)
V!Q = [(Vdd - VTn)(1 + CR (CR(1 + CR))]/(1 + CR)
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.38 Kahng & Cichy, UCSD ©2003
Read Voltages Ratios
0
0.2
0.4
0.6
0.8
1
1.2
0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4
Cell Ratio (CR)
Vo
ltag
e R
ise
on
!Q
Vdd = 2.5VVTn = 0.5V
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.39 Kahng & Cichy, UCSD ©2003
SRAM Cell Analysis (Write)
!BL=1 BL=0
WL=1
M1
M4
M5M6
Q=1!Q=0
Pullup Ratio (PR) = (WM4/LM4)/(WM6/LM6)
VQ = (Vdd - VTn) ((Vdd – VTn)2 – (p/n)(PR)((Vdd – VTn - VTp)2)
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.40 Kahng & Cichy, UCSD ©2003
Write Voltages Ratios
0
0.2
0.4
0.6
0.8
1
0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4
Pullup Ratio (PR)
Wri
te V
olt
age
(VQ
)
Vdd = 2.5V|VTp| = 0.5V
p/n = 0.5
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.41 Kahng & Cichy, UCSD ©2003
Cell Sizing
Keeping cell size minimized is critical for large caches
Minimum sized pull down fets (M1 and M3) Requires minimum width and longer than minimum channel
length pass transistors (M5 and M6) to ensure proper CR But sizing of the pass transistors increases capacitive load on
the word lines and limits the current discharged on the bit lines both of which can adversely affect the speed of the read cycle
Minimum width and length pass transistors Boost the width of the pull downs (M1 and M3) Reduces the loading on the word lines and increases the
storage capacitance in the cell – both are good! – but cell size may be slightly larger
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.42 Kahng & Cichy, UCSD ©2003
6T-SRAM Layout
VDD
GND
WL
BLBL
M1 M3
M4M2
M5 M6
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.43 Kahng & Cichy, UCSD ©2003
1-Transistor DRAM Cell
M1 X
BL
WL
X Vdd-Vt
WLwrite“1”
BL Vdd
Write: Cs is charged (or discharged) by asserting WL and BLRead: Charge redistribution occurs between CBL and Cs
Cs
read“1”
Vdd/2 sensing
Read is destructive, so must refresh after read
CBL
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.44 Kahng & Cichy, UCSD ©2003
1-T DRAM Cell
(a) Cross-section
(b) Layout
Diffusedbit line
Polysiliconplate
M1 wordline
Capacitor
Polysilicongate
Metal word line
SiO2
n+ Field Oxide
Inversion layerinduced by plate bias
n+
poly
poly
Used Polysilicon-Diffusion Capacitance
Expensive in Area
Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.45 Kahng & Cichy, UCSD ©2003
DRAM Cell Observations
DRAM memory cells are single ended (complicates the design of the sense amp)
1T cell requires a sense amp for each bit line due to charge redistribution read
1T cell read is destructive; refresh must follow to restore data
1T cell requires an extra capacitor that must be explicitly included in the design
A threshold voltage is lost when writing a 1 can be circumvented by bootstrapping the word lines to a higher
value than Vdd
Not usually available on chip, unless analog elements are present
CSE241 R2 Datapath/Memory.46 Kahng & Cichy, UCSD ©2003
Review: Basic Building Blocks
Datapath Execution units
- Adder, multiplier, divider, shifter, etc.
Register file and pipeline registers
Memories SRAM cell DRAM Other types
- 1T SRAM
CSE241 R2 Datapath/Memory.47 Kahng & Cichy, UCSD ©2003
Non-Volatile Memories (Present)
Standard ROM Programmed during fabrication Diffusion programmable / metal or via programmable options
One Time Programmable (OTP) ROM • Involves blowing of fuses – after fabrication
Erasable Programmable ROM (EPROM) • Erase and Program through UV light application
Electrically Erasable Programmable ROM (EEPROM) Programmable by application of high voltage Involves two supply voltages – normally not a problem for
today’s chips
CSE241 R2 Datapath/Memory.48 Kahng & Cichy, UCSD ©2003
Future Memory Lanscape
Magneto-resistive RAM (~2004 ) IBM, Motorola, Infineon, Nonvolatile Electronics (NVE)
Ferro-electric RAM (FRAM/ FeRAM) ( ~ 2004) Ramtron, Symetrix, Fujitsu, Toshiba, IBM/ Infineon, Samsung,
Motorola, Hitachi, Matsuhita, Micron
Ovonic Unified Memory (OUM) (~2004) Ovonyx, Intel, STMicroelectronics, British Aerospace
Nano-Floating Gate memory ( >2005 )
Single/ Few electron memories (SET) ( >2007)
Molecular memories ( >2010 )