15 DatapathComponents 4 - intra.engr.ucr.eduhtseng/classes/ee120a_2020sp/slides/15... · •Clock -- Pulsing signal for enabling latches; ticks like a clock • The clock's period

Datapath component (4)Prof. Usagi

Recap: Memory “hierarchy” in modern processor architectures

2

Processor

DRAM

Storage

SRAM $

Processor Core

Registers

larger

fastest

< 1ns

tens of ns

tens of ns

a few ns

GBs

TBs

32 or 64 words

KBs ~ MBs

L1 $

L2 $

L3 $

fastest

larger

Program-erase cycles: SLC v.s. MLC v.s. TLC v.s. QLC

3

• Regarding the following flash memory characteristics, please identify how many of the following statements are correct ① Flash memory cells can only be programmed with limited times ② The reading latency of flash memory cells can be largely different from

programming ③ The latency of programming different flash memory pages can be different ④ The programmed cell cannot be reprogrammed again unless its charge level is

refilled to the top-level A. 0 B. 1 C. 2 D. 3 E. 4

4

Recap: Flash memory characteristics

• Software designer should be aware of the characteristics of underlying hardware components

5

If programmer doesn’t know flash “features”

• Clock -- Pulsing signal for enabling latches; ticks like a clock • The clock's period must be longer than the longest delay from the state register's output to

the state register's input, known as the critical path. • Synchronous circuit: sequential circuit with a clock • Clock period: time between pulse starts

• Above signal: period = 20 ns • Clock cycle: one such time interval

• Above signal shows 3.5 clock cycles • Clock duty cycle: time clock is high

• 50% in this case • Clock frequency: 1/period

• Above : freq = 1 / 20ns = 50MHz;

6

Recap: Clock signal

0ns 10ns 20ns 30ns 40ns 50ns 60ns 70ns 80ns 90ns

Recap: Serial Adders

7

Full Adder

si

Clk

aibici

ci+1

Excitation Table of Serial Adder

8

ai bi ci ci+1 si

0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1

ai

bi

si

D Flip-flopD Q

• Assume each gate delay is 1ns and the delay in a register is 2ns. Which of the following path determines the “cycle time” of the circuit?

A. A B. B C. C D. D

9

Critical path of the circuit?

ai

bi

si

D Flip-flopD Q

A

B

C

D

Poll close in

• Assume each gate delay is 1ns and the delay in a register is 2ns. Which of the following path determines the “cycle time” of the circuit?

A. A B. B C. C D. D

10

Critical path of the circuit?

ai

bi

si

D Flip-flopD Q

A

B

C

D

• Assume each gate delay is 1ns and the delay in a register is 2ns, what’s the cycle time of the circuit?

A. 2 ns B. 3 ns C. 4 ns D. 5 ns E. 6 ns

11

Cycle time of the circuit?

ai

bi

si

D Flip-flopD Q

Poll close in

• Assume each gate delay is 1ns and the delay in a register is 2ns, what’s the cycle time of the circuit?

A. 2 ns B. 3 ns C. 4 ns D. 5 ns E. 6 ns

12

Cycle time of the circuit?

ai

bi

si

D Flip-flopD Q

• Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. Please rank their maximum operating frequencies ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (2) > (1) > (3) > (4) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2)

13

Recap: Frequency

117ns = 58.8MHz

164ns

= 15.6MHz1

5ns= 200MHz

14ns = 250MHz

• Consider the following adders? ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 32-bit serial adders made with 4-bit CLA adders ④ 32-bit serial adders made with 1-bit full adders A. Area: (1) > (2) > (3) > (4) Delay: (1) < (2) < (3) < (4) B. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (2) < (4) C. Area: (1) > (3) > (4) > (2) Delay: (1) < (3) < (4) < (2) D. Area: (1) > (2) > (3) > (4) Delay: (1) < (3) < (2) < (4) E. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (4) < (2)

14

Recap: Area/Delay of adders

Each CLA — 2-gate delay — 8*2+1 ~ 17Each carry — 2-gate delay — 64

Each CLA — (3-gate delay + 2-gate delay)*8 cycles — 5*8+1 = 41Each CLA — (2-gate delay + 2-gate delay)*32 cycles — 4*32 = 128

Frequency != End-to-end latency

15

• Pipelining • Multipliers

16

Outline

• Different parts of the hardware works on different requests/commands simultaneously

• A clock signal controls and synchronize the beginning and the end of each part/stage of the work

• A pipeline register between different parts of the hardware to keep intermediate results necessary for the upcoming work • Register is basically an array of flip-flops!

17

Pipelining

Pipelining

18

Pipelining a 4-bit serial adder

19

Serial Adder

# 1

Serial Adder

# 2

Serial Adder

# 3

Serial Adder

# 4

Pipelining a 4-bit serial adder

20

add a, b add c, d add e, f add g, h add i, j add k, l add m, n add o, p add q, r add s, t add u, v

1st 2nd 1st

3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 2nd 1st

4th 3rd 4th 2nd 3rd 4th

t

After this point, we are completing an add operation each cycle!

CyclesAdd

= 1

• Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. And we are processing 10 million of add operations. Please rank their total time in finishing these 10 million adds. ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) < (2) < (3) < (4) B. (2) < (1) < (4) < (3) C. (3) < (4) < (2) < (1) D. (4) < (3) < (2) < (1) E. (4) < (3) < (1) < (2)

21

What if we have millions of adds to do?Poll close in

• Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. And we are processing 10 million of add operations. Please rank their total time in finishing these 10 million adds. ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) < (2) < (3) < (4) B. (2) < (1) < (4) < (3) C. (3) < (4) < (2) < (1) D. (4) < (3) < (2) < (1) E. (4) < (3) < (1) < (2)

22

What if we have millions of adds to do?

• Latency — the amount of time to finish an operation • access time • response time

• Throughput — the amount of work can be done within a given period of time • bandwidth (MB/Sec, GB/Sec, Mbps, Gbps) • IOPs • MFLOPs

23

Latency/Delay v.s. Bandwidth/Throughput

Toyota Prius 100 Gb Network

bandwidth 290GB/sec 100 Gb/s or 12.5GB/sec

latency 3.5 hours 2 Peta-byte over 167772 seconds = 1.94 Days

response time You see nothing in the first 3.5 hours You can start watching the movie as soon as you get a frame!

Latency/Delay v.s. Throughput

24

•100 miles (161 km) from UCSD •75 MPH on highway! •Max load: 374 kg = 2,770 hard drives

(2TB per drive)

•100 miles (161 km) from UCSD •Lightspeed! — 3*108m/sec •Max load:4 lanes operating at 25GHz

• Consider the following adders. Please rank the number of transistors in implementing each of them ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (3) > (4) > (2) > (1) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2)

25

Area/CostPoll close in

• How many transistors do we need to implement a 4-bit CLA logic?

A. 38 B. 64 C. 88 D. 116 E. 128

26

Recap: CLA’s size

C1 = G0 + P0 C0

C2 = G1 + P1 C1

Gi = AiBi

Pi = Ai XOR Bi

C3 = G2 + P2 C2

C4 = G3 + P3 C3

= G1 + P1 (G0 + P0 C0)= G1 + P1G0 + P1P0C0

= G2 + P2 G1 + P2 P1G0 + P2 P1P0C0

= G3 + P3 G2 + P3 P2 G1 + P3 P2 P1G0 + P3 P2 P1P0C0

4 + 4 = 8

4 + 6 + 6 = 16

4 + 6 + 8 + 8 =26

4 + 6 + 8 + 10 + 10 = 38

Si = Ai XOR Bi XOR Ci

Recap: Excitation Table of Serial Adder

27

ai bi ci ci+1 si

0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1

ai

bi

si

D Flip-flopD Q

• Consider the following adders. Please rank the number of transistors in implementing each of them ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (3) > (4) > (2) > (1) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2)

28

Area/Cost

— 1952 transistors— 1600 transistors

— (50 transistors )*32 + (2+…+32)*18 transistors= 2127— (244 transistors)*8 + 7+ (8+12+16+20+24+28+32)*18 transistors= 4479

— pipelining needs to “duplicate” serial units and use more area

• The number of transistors we can build in a fixed area of silicon doubles every 12 ~ 24 months.

29

Moore’s Law

(1) Moore, G. E. (1965), 'Cramming more components onto integrated circuits', Electronics 38 (8) .

(1)Tr

ansis

tor C

ount

110

1001,000

10,000100,000

1,000,00010,000,000

100,000,0001,000,000,000

10,000,000,000

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Moore’s Law is the most important driver for

historic CPU performance gains

— pipelining needs to “duplicate” serial units and use more area — this is why —

Multiplier

30

• Thinking about how you do this by hand in decimal!

31

Binary multiplication

1 2 3 4× 5 6 7 8

9 8 7 28 6 3 8

7 4 0 46 1 7 07 0 0 6 6 5 2

0 1 1 1× 1 1 0 00 0 0 0

0 0 0 00 1 1 1

0 1 1 11 0 1 0 1 0 0

a3 a2 a1 a0× b3 b2 b1 b0a3b0 a2b0 a1b0 a0b0

a3b1 a2b1 a1b1 a0b1 0a3b2 a2b2 a1b2 a0b2 0 0

a3b3 a2b3 a1b3 a0b3 0 0 0p7 p6 p5 p4 p3 p2 p1 p0

pp1

pp2

pp3

pp4

Shifters

32

Recap: Shift “Right”

33

shamt2

11 10 01 00

MUX11 10 01 00

MUX11 10 01 00

MUX11 10 01 00

MUX

Y0Y1Y2Y3

Based on the value of the selection input (shamt = shift amount)

The “chain” of multiplexers determines how many bits to shift

A3 A2 A1 A00 Example: if S = 01 then Y3 = 0 Y2 = A3 Y1 = A2 Y0 = A1

Example: if S = 10 then Y3 = 0 Y2 = 0 Y1 = A3 Y0 = A2

Example: if S = 11 then Y3 = 0 Y2 = 0 Y1 = 0 Y0 = A3

• Refer to the shift right logic, what do we need to modify to perform shift left?

A. We can alter the interpretation of shamt to support shift left

B. We don’t need to modify the circuit, just take a not on every input

C. We don’t need to modify the circuit, just change the order of inputs

D. We don’t need to modify the circuit, just change the order of outputs

E. None of the above34

How to support shift left?Poll close in

• Refer to the shift right logic, what do we need to modify to perform shift left?

A. We can alter the interpretation of shamt to support shift left

B. We don’t need to modify the circuit, just take a not on every input

C. We don’t need to modify the circuit, just change the order of inputs

D. We don’t need to modify the circuit, just change the order of outputs

E. None of the above35

How to support shift left?

Shift “Left”

36

Example: if S = 01 then Y3 = A2 Y2 = A1 Y1 = A0 Y0 = 0

Example: if S = 10 then Y3 = A1 Y2 = A0 Y1 = 0Y0 = 0

Example: if S = 11 then Y3 = A0 Y2 = 0 Y1 = 0Y0 = 0

shamt2

11 10 01 00

MUX11 10 01 00

MUX11 10 01 00

MUX11 10 01 00

MUX

Y0Y1Y2Y3

0 A0 A1 A2 A3

Generic Shifter

37

shamt2

11 10 01 00

MUX11 10 01 00

MUX11 10 01 00

MUX11 10 01 00

MUX

Y0Y1Y2Y3

01 0

MUX1 0

MUX1 0

MUX1 0

MUX

A3 A2 A1 A0

SHL?

Let’s get back on Multiplier

38

Shift and add

39

B0

0 0 0 0 A3A2A1A0

8-bit Shifter SHL = 1

8-bit Adder

0 1 0

MUX8

80

8

8

1 0

MUX B1

0

8-bit Adder


1 0

MUX B2

0

8-bit Adder


1 0

MUX B3

0

+5

+2

+2

+4

+5

+2

+4

+5

+2

+4 +5— 40 gate delays

Array style

40

b0

b1

b2

b3

a0 a1 a2 a3

5-bit adder

0 0

6-bit adder

00 0

0

7-bit adder

000

p7 p6 p5 p4 p3 p2 p1 p0

• What’s the estimated gate-delay of the 4-bit multiplier? (Assume adders are composed of 4-bit CLAs)

A. 9 B. 12 C. 13 D. 15 E. 16

41

Gate-delays of Array-style MultipliersPoll close in

• What’s the estimated gate-delay of the 4-bit multiplier? (Assume adders are composed of 4-bit CLAs)

A. 9 B. 12 C. 13 D. 15 E. 16

42

Gate-delays of Array-style Multipliers

+1

+5

+5

+5

• What’s the estimated gate-delay of a 32-bit multiplier? (Assume adders are composed of 4-bit CLAs)

A. 0 — 100 B. 100 — 500 C. 500 — 1000 D. 1000 — 1500 E. > 1500

43

Gate-delays of 32-bit array-style multipliersPoll close in

• What’s the estimated gate-delay of a 32-bit multiplier? (Assume adders are composed of 4-bit CLAs)

A. 0 — 100 B. 100 — 500 C. 500 — 1000 D. 1000 — 1500 E. > 1500

44

Gate-delays of 32-bit array-style multipliers

We need 33-64 bit adders33 - 36 -bit adders —> (9*2+1) gate delays *4

37 - 40 -bit adders —> (10*2+1) gate delays *441 - 44 -bit adders —> (11*2+1) gate delays *4

45 - 48 -bit adders —> (12*2+1) gate delays *449 - 52 -bit adders —> (13*2+1) gate delays *453 - 56 -bit adders —> (14*2+1) gate delays *457 - 60 -bit adders —> (15*2+1) gate delays *461 - 64 -bit adders —> (16*2+1) gate delays *4

4*2*(9+10+11+12+13+14+15+16+1) = 808

Each n-bit adder is roundup(n/4)*2+1

• Lab 5 due tonight • Lab 6 is up — due on 6/2

• Watch the video and read the instruction BEFORE your session • There are links on both course webpage and iLearn lab section • Submit through iLearn > Labs

• Office Hours • All office hours share the same meeting instance — if you have registered once, you

cannot do it again. • Zoom does not resend registration confirmation and does not allow us to “re-approve” if

you have registered • The only way is to dig out the e-mail from Zoom

• Last reading quiz due next Tuesday • Check your grades in iLearn

45

Announcement

つづく

ElectricalComputerEngineering

Science 120A

Documents

15 DatapathComponents 4 - intra.engr.ucr.eduhtseng/classes/ee120a_2020sp/slides/15... · •Clock -- Pulsing signal for enabling latches; ticks like a clock • The clock's period