Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Datapath component (4)Prof. Usagi
Recap: Memory “hierarchy” in modern processor architectures
2
Processor
DRAM
Storage
SRAM $
Processor Core
Registers
larger
fastest
< 1ns
tens of ns
tens of ns
a few ns
GBs
TBs
32 or 64 words
KBs ~ MBs
L1 $
L2 $
L3 $
fastest
larger
Program-erase cycles: SLC v.s. MLC v.s. TLC v.s. QLC
3
• Regarding the following flash memory characteristics, please identify how many of the following statements are correct ① Flash memory cells can only be programmed with limited times ② The reading latency of flash memory cells can be largely different from
programming ③ The latency of programming different flash memory pages can be different ④ The programmed cell cannot be reprogrammed again unless its charge level is
refilled to the top-level A. 0 B. 1 C. 2 D. 3 E. 4
4
Recap: Flash memory characteristics
• Software designer should be aware of the characteristics of underlying hardware components
5
If programmer doesn’t know flash “features”
• Clock -- Pulsing signal for enabling latches; ticks like a clock • The clock's period must be longer than the longest delay from the state register's output to
the state register's input, known as the critical path. • Synchronous circuit: sequential circuit with a clock • Clock period: time between pulse starts
• Above signal: period = 20 ns • Clock cycle: one such time interval
• Above signal shows 3.5 clock cycles • Clock duty cycle: time clock is high
• 50% in this case • Clock frequency: 1/period
• Above : freq = 1 / 20ns = 50MHz;
6
Recap: Clock signal
0ns 10ns 20ns 30ns 40ns 50ns 60ns 70ns 80ns 90ns
Recap: Serial Adders
7
Full Adder
si
Clk
aibici
ci+1
Excitation Table of Serial Adder
8
ai bi ci ci+1 si
0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1
ai
bi
si
D Flip-flopD Q
• Assume each gate delay is 1ns and the delay in a register is 2ns. Which of the following path determines the “cycle time” of the circuit?
A. A B. B C. C D. D
9
Critical path of the circuit?
ai
bi
si
D Flip-flopD Q
A
B
C
D
Poll close in
• Assume each gate delay is 1ns and the delay in a register is 2ns. Which of the following path determines the “cycle time” of the circuit?
A. A B. B C. C D. D
10
Critical path of the circuit?
ai
bi
si
D Flip-flopD Q
A
B
C
D
• Assume each gate delay is 1ns and the delay in a register is 2ns, what’s the cycle time of the circuit?
A. 2 ns B. 3 ns C. 4 ns D. 5 ns E. 6 ns
11
Cycle time of the circuit?
ai
bi
si
D Flip-flopD Q
Poll close in
• Assume each gate delay is 1ns and the delay in a register is 2ns, what’s the cycle time of the circuit?
A. 2 ns B. 3 ns C. 4 ns D. 5 ns E. 6 ns
12
Cycle time of the circuit?
ai
bi
si
D Flip-flopD Q
• Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. Please rank their maximum operating frequencies ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (2) > (1) > (3) > (4) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2)
13
Recap: Frequency
117ns = 58.8MHz
164ns
= 15.6MHz1
5ns= 200MHz
14ns = 250MHz
• Consider the following adders? ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders A. Area: (1) > (2) > (3) > (4) Delay: (1) < (2) < (3) < (4) B. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (2) < (4) C. Area: (1) > (3) > (4) > (2) Delay: (1) < (3) < (4) < (2) D. Area: (1) > (2) > (3) > (4) Delay: (1) < (3) < (2) < (4) E. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (4) < (2)
14
Recap: Area/Delay of adders
Each CLA — 2-gate delay — 8*2+1 ~ 17Each carry — 2-gate delay — 64
Each CLA — (3-gate delay + 2-gate delay)*8 cycles — 5*8+1 = 41Each CLA — (2-gate delay + 2-gate delay)*32 cycles — 4*32 = 128
Frequency != End-to-end latency
15
• Pipelining • Multipliers
16
Outline
• Different parts of the hardware works on different requests/commands simultaneously
• A clock signal controls and synchronize the beginning and the end of each part/stage of the work
• A pipeline register between different parts of the hardware to keep intermediate results necessary for the upcoming work • Register is basically an array of flip-flops!
17
Pipelining
Pipelining
18
Pipelining a 4-bit serial adder
19
Serial Adder
# 1
Serial Adder
# 2
Serial Adder
# 3
Serial Adder
# 4
Pipelining a 4-bit serial adder
20
add a, b add c, d add e, f add g, h add i, j add k, l add m, n add o, p add q, r add s, t add u, v
1st 2nd 1st
3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 2nd 1st
4th 3rd 4th 2nd 3rd 4th
t
After this point, we are completing an add operation each cycle!
CyclesAdd
= 1
• Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. And we are processing 10 million of add operations. Please rank their total time in finishing these 10 million adds. ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) < (2) < (3) < (4) B. (2) < (1) < (4) < (3) C. (3) < (4) < (2) < (1) D. (4) < (3) < (2) < (1) E. (4) < (3) < (1) < (2)
21
What if we have millions of adds to do?Poll close in
• Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. And we are processing 10 million of add operations. Please rank their total time in finishing these 10 million adds. ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) < (2) < (3) < (4) B. (2) < (1) < (4) < (3) C. (3) < (4) < (2) < (1) D. (4) < (3) < (2) < (1) E. (4) < (3) < (1) < (2)
22
What if we have millions of adds to do?
• Latency — the amount of time to finish an operation • access time • response time
• Throughput — the amount of work can be done within a given period of time • bandwidth (MB/Sec, GB/Sec, Mbps, Gbps) • IOPs • MFLOPs
23
Latency/Delay v.s. Bandwidth/Throughput
Toyota Prius 100 Gb Network
bandwidth 290GB/sec 100 Gb/s or 12.5GB/sec
latency 3.5 hours 2 Peta-byte over 167772 seconds = 1.94 Days
response time You see nothing in the first 3.5 hours You can start watching the movie as soon as you get a frame!
Latency/Delay v.s. Throughput
24
•100 miles (161 km) from UCSD •75 MPH on highway! •Max load: 374 kg = 2,770 hard drives
(2TB per drive)
•100 miles (161 km) from UCSD •Lightspeed! — 3*108m/sec •Max load:4 lanes operating at 25GHz
• Consider the following adders. Please rank the number of transistors in implementing each of them ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (3) > (4) > (2) > (1) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2)
25
Area/CostPoll close in
• How many transistors do we need to implement a 4-bit CLA logic?
A. 38 B. 64 C. 88 D. 116 E. 128
26
Recap: CLA’s size
C1 = G0 + P0 C0
C2 = G1 + P1 C1
Gi = AiBi
Pi = Ai XOR Bi
C3 = G2 + P2 C2
C4 = G3 + P3 C3
= G1 + P1 (G0 + P0 C0)= G1 + P1G0 + P1P0C0
= G2 + P2 G1 + P2 P1G0 + P2 P1P0C0
= G3 + P3 G2 + P3 P2 G1 + P3 P2 P1G0 + P3 P2 P1P0C0
4 + 4 = 8
4 + 6 + 6 = 16
4 + 6 + 8 + 8 =26
4 + 6 + 8 + 10 + 10 = 38
Si = Ai XOR Bi XOR Ci
Recap: Excitation Table of Serial Adder
27
ai bi ci ci+1 si
0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1
ai
bi
si
D Flip-flopD Q
• Consider the following adders. Please rank the number of transistors in implementing each of them ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (3) > (4) > (2) > (1) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2)
28
Area/Cost
— 1952 transistors— 1600 transistors
— (50 transistors )*32 + (2+…+32)*18 transistors= 2127— (244 transistors)*8 + 7+ (8+12+16+20+24+28+32)*18 transistors= 4479
— pipelining needs to “duplicate” serial units and use more area
• The number of transistors we can build in a fixed area of silicon doubles every 12 ~ 24 months.
29
Moore’s Law
(1) Moore, G. E. (1965), 'Cramming more components onto integrated circuits', Electronics 38 (8) .
(1)Tr
ansis
tor C
ount
110
1001,000
10,000100,000
1,000,00010,000,000
100,000,0001,000,000,000
10,000,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Moore’s Law is the most important driver for
historic CPU performance gains
— pipelining needs to “duplicate” serial units and use more area — this is why —
Multiplier
30
• Thinking about how you do this by hand in decimal!
31
Binary multiplication
1 2 3 4× 5 6 7 8
9 8 7 28 6 3 8
7 4 0 46 1 7 07 0 0 6 6 5 2
0 1 1 1× 1 1 0 00 0 0 0
0 0 0 00 1 1 1
0 1 1 11 0 1 0 1 0 0
a3 a2 a1 a0× b3 b2 b1 b0a3b0 a2b0 a1b0 a0b0
a3b1 a2b1 a1b1 a0b1 0a3b2 a2b2 a1b2 a0b2 0 0
a3b3 a2b3 a1b3 a0b3 0 0 0p7 p6 p5 p4 p3 p2 p1 p0
pp1
pp2
pp3
pp4
Shifters
32
Recap: Shift “Right”
33
shamt2
11 10 01 00
MUX11 10 01 00
MUX11 10 01 00
MUX11 10 01 00
MUX
Y0Y1Y2Y3
Based on the value of the selection input (shamt = shift amount)
The “chain” of multiplexers determines how many bits to shift
A3 A2 A1 A00 Example: if S = 01 then Y3 = 0 Y2 = A3 Y1 = A2 Y0 = A1
Example: if S = 10 then Y3 = 0 Y2 = 0 Y1 = A3 Y0 = A2
Example: if S = 11 then Y3 = 0 Y2 = 0 Y1 = 0 Y0 = A3
• Refer to the shift right logic, what do we need to modify to perform shift left?
A. We can alter the interpretation of shamt to support shift left
B. We don’t need to modify the circuit, just take a not on every input
C. We don’t need to modify the circuit, just change the order of inputs
D. We don’t need to modify the circuit, just change the order of outputs
E. None of the above34
How to support shift left?Poll close in
• Refer to the shift right logic, what do we need to modify to perform shift left?
A. We can alter the interpretation of shamt to support shift left
B. We don’t need to modify the circuit, just take a not on every input
C. We don’t need to modify the circuit, just change the order of inputs
D. We don’t need to modify the circuit, just change the order of outputs
E. None of the above35
How to support shift left?
Shift “Left”
36
Example: if S = 01 then Y3 = A2 Y2 = A1 Y1 = A0 Y0 = 0
Example: if S = 10 then Y3 = A1 Y2 = A0 Y1 = 0Y0 = 0
Example: if S = 11 then Y3 = A0 Y2 = 0 Y1 = 0Y0 = 0
shamt2
11 10 01 00
MUX11 10 01 00
MUX11 10 01 00
MUX11 10 01 00
MUX
Y0Y1Y2Y3
0 A0 A1 A2 A3
Generic Shifter
37
shamt2
11 10 01 00
MUX11 10 01 00
MUX11 10 01 00
MUX11 10 01 00
MUX
Y0Y1Y2Y3
01 0
MUX1 0
MUX1 0
MUX1 0
MUX
A3 A2 A1 A0
SHL?
Let’s get back on Multiplier
38
Shift and add
39
B0
0 0 0 0 A3A2A1A0
8-bit Shifter SHL = 1
8-bit Adder
0 1 0
MUX8
80
8
8
1 0
MUX B1
0
8-bit Adder
8-bit Shifter SHL = 18
1 0
MUX B2
0
8-bit Adder
8-bit Shifter SHL = 18
1 0
MUX B3
0
+5
+2
+2
+4
+5
+2
+4
+5
+2
+4 +5— 40 gate delays
Array style
40
b0
b1
b2
b3
a0 a1 a2 a3
5-bit adder
0 0
6-bit adder
00 0
0
7-bit adder
000
p7 p6 p5 p4 p3 p2 p1 p0
• What’s the estimated gate-delay of the 4-bit multiplier? (Assume adders are composed of 4-bit CLAs)
A. 9 B. 12 C. 13 D. 15 E. 16
41
Gate-delays of Array-style MultipliersPoll close in
• What’s the estimated gate-delay of the 4-bit multiplier? (Assume adders are composed of 4-bit CLAs)
A. 9 B. 12 C. 13 D. 15 E. 16
42
Gate-delays of Array-style Multipliers
+1
+5
+5
+5
• What’s the estimated gate-delay of a 32-bit multiplier? (Assume adders are composed of 4-bit CLAs)
A. 0 — 100 B. 100 — 500 C. 500 — 1000 D. 1000 — 1500 E. > 1500
43
Gate-delays of 32-bit array-style multipliersPoll close in
• What’s the estimated gate-delay of a 32-bit multiplier? (Assume adders are composed of 4-bit CLAs)
A. 0 — 100 B. 100 — 500 C. 500 — 1000 D. 1000 — 1500 E. > 1500
44
Gate-delays of 32-bit array-style multipliers
We need 33-64 bit adders33 - 36 -bit adders —> (9*2+1) gate delays *4
37 - 40 -bit adders —> (10*2+1) gate delays *441 - 44 -bit adders —> (11*2+1) gate delays *4
45 - 48 -bit adders —> (12*2+1) gate delays *449 - 52 -bit adders —> (13*2+1) gate delays *453 - 56 -bit adders —> (14*2+1) gate delays *457 - 60 -bit adders —> (15*2+1) gate delays *461 - 64 -bit adders —> (16*2+1) gate delays *4
4*2*(9+10+11+12+13+14+15+16+1) = 808
Each n-bit adder is roundup(n/4)*2+1
• Lab 5 due tonight • Lab 6 is up — due on 6/2
• Watch the video and read the instruction BEFORE your session • There are links on both course webpage and iLearn lab section • Submit through iLearn > Labs
• Office Hours • All office hours share the same meeting instance — if you have registered once, you
cannot do it again. • Zoom does not resend registration confirmation and does not allow us to “re-approve” if
you have registered • The only way is to dig out the e-mail from Zoom
• Last reading quiz due next Tuesday • Check your grades in iLearn
45
Announcement
つづく
ElectricalComputerEngineering
Science 120A