Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Pipelining
1
Changelog
Changes made in this version not seen in first lecture:5 October 2017: put addq timing slide before critical path slides5 October 2017: slide 25: arrows to fetch/decode registers point tooutputs, not inputs5 October 2017: slide 27: sum with no pipelining was 550 ps, not 500 ps5 October 2017: slide 33: e_dstE and W_dstE labels were swapped5 October 2017: slide 34: rA should have been D_rA, e_valA shouldhave been d_valA
1
logistics
exam graded — scores on gradebookkeys on Collabview exam, submit regrade requests via TPEGSnote: two questions dropped outside of TPEGS
median: 83.5; 25th percentile: 77.3; 75th percentile: 90.7
HCL homework due next Wednesday
2
Human pipeline: laundry
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
3
Human pipeline: laundry
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
3
Waste (1)
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
wasted time!wasted time!
4
Waste (1)
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
wasted time!wasted time!
4
Waste (2)
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
5
Latency — Time for One
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
pipelined latency (2.1 h)
colors colors colors
normal latency (1.8 h)
6
Latency — Time for One
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
pipelined latency (2.1 h)
colors colors colors
normal latency (1.8 h)
6
Latency — Time for One
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
pipelined latency (2.1 h)
colors colors colors
normal latency (1.8 h)
6
Throughput — Rate of Many
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
time between finishes (0.83 h)
1 load0.83h = 1.2 loads/h
time between starts (0.83 h)
7
Throughput — Rate of Many
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
time between finishes (0.83 h)
1 load0.83h = 1.2 loads/h
time between starts (0.83 h)
7
Throughput — Rate of Many
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
time between finishes (0.83 h)
1 load0.83h = 1.2 loads/h
time between starts (0.83 h)
7
times three circuit
AADDADD
ADDADD
2 × A
3 × A
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
21
0 ps 50 ps 100 ps100 ps latency =⇒ 10 results/ns throughput
8
times three circuit
AADDADD
ADDADD
2 × A
3 × A
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
21
0 ps 50 ps 100 ps
100 ps latency =⇒ 10 results/ns throughput
8
times three circuit
AADDADD
ADDADD
2 × A
3 × A
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
21
0 ps 50 ps 100 ps100 ps latency =⇒ 10 results/ns throughput
8
times three and repeat
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
17
34
4
8
1
2
23
46
0 ps 100 ps 200 ps 300 ps 400 ps 500 ps
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
21
17
34
51
4
8
12
1
2
3
23
46
69
0 ps 100 ps 200 ps 300 ps 400 ps 500 ps
9
times three and repeat
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
17
34
4
8
1
2
23
46
0 ps 100 ps 200 ps 300 ps 400 ps 500 ps
Aadd A + A
2 × Aadd 2A + A
3 × A
7
14
21
17
34
51
4
8
12
1
2
3
23
46
69
0 ps 100 ps 200 ps 300 ps 400 ps 500 ps
9
pipelined times three
A (t + 2)
ADDADD
ADDADD2 × A (t + 1)
A (t + 1) 3 × A (t + 0)
A (t + 2)A (t + 1)
2 × A (t + 1)3 × A (t + 0)
7714
21
171734
10
pipelined times three
A (t + 2)
ADDADD
ADDADD2 × A (t + 1)
A (t + 1) 3 × A (t + 0)
A (t + 2)A (t + 1)
2 × A (t + 1)3 × A (t + 0)
7714
21
171734
10
register tolerances
register outputregister input
outputchanges
inputmust notchange
register delay
11
register tolerances
register outputregister input
outputchanges
inputmust notchange
register delay
11
register tolerances
register outputregister input
outputchanges
inputmust notchange
register delay
11
times three pipeline timing
A (t + 2)
ADDADD
ADDADD2 × A (t + 1)
A (t + 1) 3 × A (t + 0)
10 ps 50 ps 10 ps 50 ps 10 ps
throughput: 160 ps ≈ 16 G operations/sec
12
times three pipeline timing
A (t + 2)
ADDADD
ADDADD2 × A (t + 1)
A (t + 1) 3 × A (t + 0)
10 ps 50 ps 10 ps 50 ps 10 ps
throughput: 160 ps ≈ 16 G operations/sec
12
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
13
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
13
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?
throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
13
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?
throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
13
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
14
diminishing returns: register delays
logic (all)
100 ps
110 psper cycle
10 ps
logic (1/2)
50 ps
60 psper cycle
10 ps
logic (2/2)
50 ps 10 ps
logic (1/3)
33 ps
43 psper cycle
10 ps
logic (2/3)
33 ps 10 ps
logic (3/3)
33 ps 10 ps... ... ... ...
1 ps
11 psper cycle
10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps
…
15
diminishing returns: register delays
2 4 6 8 10 12 140
20
40
60
80
100
120
number of stages
time
perc
ompl
etio
n(p
s)
16
diminishing returns: register delays
2 4 6 8 10 12 140
20
40
60
80
100
120
register delay
number of stages
time
perc
ompl
etio
n(p
s)
16
diminishing returns: register delays
2 4 6 8 10 12 140
20
40
60
80
100
120
register delay
1.83x speedup
1.02x speedup
number of stages
time
perc
ompl
etio
n(p
s)
16
diminishing returns: register delays
2 4 6 8 10 12 140
20
40
60
80
100
1.83x throughput
1.02x throughput
number of stages
thro
ughp
ut(o
ps/n
s)
17
diminishing returns: register delays
2 4 6 8 10 12 140
20
40
60
80
100
1.83x throughput
1.02x throughput
max. rate of register updates
number of stages
thro
ughp
ut(o
ps/n
s)
17
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
18
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)
throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
19
deeper pipeline
A (t + 4)
ADDADD
ADDADD2 × A (t + 2)
A (t + 2) 3 × A (t + 0)
10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps
exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec
30 ps 20 ps
exercise: throughput now? (didn’t split second add evenly)
throughput: 140 ps ≈ 25 G ops/sec
A (t + 3)
2 × Apartial results
3 × Apartial results
Problem: How much faster can we get?
Problem: Can we even do this?
19
diminishing returns: uneven split
Can we split up some logic (e.g. adder) arbitrarily?Probably not...
logic (all)
100 ps
110 psper cycle
10 ps
logic (1/2)
60 ps
70 psper cycle
10 ps
logic (2/2)
45 ps 10 ps
logic(1/3)
40 ps
50 psper cycle
10 ps
logic(2/3)
40 ps 10 ps
logic(3/3)
30 ps 10 ps... ... ... ... 20
diminishing returns: uneven split
Can we split up some logic (e.g. adder) arbitrarily?Probably not...
logic (all)
100 ps
110 psper cycle
10 ps
logic (1/2)
60 ps
70 psper cycle
10 ps
logic (2/2)
45 ps 10 ps
logic(1/3)
40 ps
50 psper cycle
10 ps
logic(2/3)
40 ps 10 ps
logic(3/3)
30 ps 10 ps... ... ... ... 20
diminishing returns: uneven split
Can we split up some logic (e.g. adder) arbitrarily?Probably not...
logic (all)
100 ps
110 psper cycle
10 ps
logic (1/2)
60 ps
70 psper cycle
10 ps
logic (2/2)
45 ps 10 ps
logic(1/3)
40 ps
50 psper cycle
10 ps
logic(2/3)
40 ps 10 ps
logic(3/3)
30 ps 10 ps... ... ... ... 20
textbook SEQ ‘stages’
conceptual order only
Fetch: read instruction memory
Decode: read register file
Execute: arithmetic (ALU)
Memory: read/write data memory
Writeback: write register file
PC Update: write PC register
writes happenat end of cycle
reads — “magic”like combinatorial logicas values available
21
textbook SEQ ‘stages’
conceptual order only
Fetch: read instruction memory
Decode: read register file
Execute: arithmetic (ALU)
Memory: read/write data memory
Writeback: write register file
PC Update: write PC register
writes happenat end of cycle
reads — “magic”like combinatorial logicas values available
21
textbook SEQ ‘stages’
conceptual order only
Fetch: read instruction memory
Decode: read register file
Execute: arithmetic (ALU)
Memory: read/write data memory
Writeback: write register file
PC Update: write PC register
writes happenat end of cycle
reads — “magic”like combinatorial logicas values available
21
textbook stages
conceptual order only pipeline stages
Fetch/PC Update: read instruction memory;compute next PC
Decode: read register file
Execute: arithmetic (ALU)
Memory: read/write data memory
Writeback: write register file
5 stagesone instruction in eachcompute next to start immediatelly
22
textbook stages
conceptual order only pipeline stages
Fetch/PC Update: read instruction memory;compute next PC
Decode: read register file
Execute: arithmetic (ALU)
Memory: read/write data memory
Writeback: write register file
5 stagesone instruction in eachcompute next to start immediatelly
22
addq CPU
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
decode execute
writeback
signal skips two stages
fetch andPC update
23
addq CPU
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
decode execute
writeback
signal skips two stages
fetch andPC update
23
addq CPU
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
decode execute
writeback
signal skips two stages
fetch andPC update
23
addq CPU
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
decode execute
writeback
signal skips two stages
fetch andPC update
23
pipelined addq processor
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2fetch andPC update
decode execute
writeback
fetch/decode
decode/execute
execute/writeback
fetch/fetch
24
pipelined addq processor
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch andPC update
decode execute
writeback
fetch/decode
decode/execute
execute/writeback
fetch/fetch
24
pipelined addq processor
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch andPC update
decode execute
writeback
fetch/decode
decode/execute
execute/writeback
fetch/fetch
24
pipelined addq processor
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch andPC update
decode execute
writeback
fetch/decode
decode/execute
execute/writeback
fetch/fetch
24
addq execution
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch/decode
decode/execute
execute/writeback
fetch/fetch
addq %r8, %r9 // (1)addq %r10, %r11 // (2)
addq %r8, %r9 //(1)
address of (2)
addq %r10, %r11 //(2)
reg #s 8, 9 from (1)reg #s 10, 11 from (2)reg # 9,values for (1)
25
addq execution
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch/decode
decode/execute
execute/writeback
fetch/fetch
addq %r8, %r9 // (1)addq %r10, %r11 // (2)
addq %r8, %r9 //(1)
address of (2)
addq %r10, %r11 //(2)
reg #s 8, 9 from (1)reg #s 10, 11 from (2)reg # 9,values for (1)
25
addq execution
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch/decode
decode/execute
execute/writeback
fetch/fetch
addq %r8, %r9 // (1)addq %r10, %r11 // (2)
addq %r8, %r9 //(1)
address of (2)
addq %r10, %r11 //(2)
reg #s 8, 9 from (1)
reg #s 10, 11 from (2)reg # 9,values for (1)
25
addq execution
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
fetch/decode
decode/execute
execute/writeback
fetch/fetch
addq %r8, %r9 // (1)addq %r10, %r11 // (2)
addq %r8, %r9 //(1)
address of (2)
addq %r10, %r11 //(2)
reg #s 8, 9 from (1)
reg #s 10, 11 from (2)reg # 9,values for (1)
25
addq processor timing
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8
fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8
fetch/decode decode/execute execute/writeback
26
addq processor timing
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8
fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8
fetch/decode decode/execute execute/writeback
26
addq processor timing
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8
fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8
fetch/decode decode/execute execute/writeback
26
addq processor timing
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8
fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8
fetch/decode decode/execute execute/writeback
26
addq processor timing
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8
fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8
fetch/decode decode/execute execute/writeback
26
addq processor performance
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
example delays:path timeadd 2 80 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps
no pipelining: 1 instruction per 550 psadd up everything but add 2 (critical (slowest) path)
pipelining: 1 instruction per 200 ps + pipeline register delaysslowest path through stage + pipeline register delayslatency: 800 ps + pipeline register delays (4 cycles)
27
critical path
every path from state output to state input needs enough timeoutput — may change on rising edge of clockinput — must be stable sufficiently before rising edge of clock
critical path: slowest of all these paths — determines cycle timetimes three: slowest part of ALU ended up mattering
matters with or without pipelining
28
SEQ paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …
…and many, many more paths
29
SEQ paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
path 1: 25 picoseconds
path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …
…and many, many more paths
29
SEQ paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
path 1: 25 picoseconds path 2: 50 picoseconds
path 3: 400 picoseconds path 4: 900 picoseconds… …
…and many, many more paths
29
SEQ paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds
path 4: 900 picoseconds… …
…and many, many more paths
29
SEQ paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …
…and many, many more paths
29
SEQ paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …
…and many, many more paths
29
sequential addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)
30
sequential addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path 1: 25 picoseconds
path 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)
30
sequential addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path 1: 25 picosecondspath 2: 375 picoseconds
path 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)
30
sequential addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picoseconds
path 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)
30
sequential addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picoseconds
overall cycle time: 500 picoseconds (longest path)
30
sequential addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)
30
pipelined addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2 path 1: 80 picosecondspath 2: 210 picosecondspath 3: 210 picosecondspath 4: 135 picosecondspath 5: 110 picosecondspath 6: 135 picoseconds…overall cycle time: 210 picoseconds
31
pipelined addq paths
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2 path 1: 80 picosecondspath 2: 210 picosecondspath 3: 210 picosecondspath 4: 135 picosecondspath 5: 110 picosecondspath 6: 135 picoseconds…overall cycle time: 210 picoseconds
31
exercise
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path timeadd 2 50 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps
pipeline register delay: 10ps
how will throughput improve if we double the speed of theinstruction memory?A. 2.00x B. 1.70x to 1.99xC. 1.60x to 1.69x D. 1.50x to 1.59xE. less than 1.50x
32
exercise
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
path timeadd 2 50 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps
pipeline register delay: 10ps
how will throughput improve if we double the speed of theinstruction memory?A. 2.00x B. 1.70x to 1.99xC. 1.60x to 1.69x D. 1.50x to 1.59xE. less than 1.50x
1135
÷ 1210
= 1.56x — D
32
pipeline register naming convention
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
split
0xF
ADDADD
add 2
f_rA D_rA
d_dstE E_dstE
e_dstEW_dstE33
pipeline register naming convention
f — fetch sends values here
D — decode receives values here
d — decode sends values here
…
34
addq HCL
.../* f: from fetch */f_rA = i10bytes[12..16];f_rB = i10bytes[12..16];
/* fetch to decode *//* f_rA -> D_rA, etc. */register fD {
rA : 4 = REG_NONE;rB : 4 = REG_NONE;
}
/* D: to decoded: from decode */
d_dstE = D_rB;/* use register file: */reg_srcA = D_rA;d_valA = reg_outputA;...
/* decode to execute */register dE {
dstE : 4 = REG_NONE;valA : 64 = 0;valB : 64 = 0;
}
...35
SEQ without stages
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
36
SEQ with stages
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
rule: signal to next stage (except flow control)
37
SEQ with stages
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
rule: signal to next stage (except flow control)
37
SEQ with stages
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
rule: signal to next stage (except flow control)
37
SEQ with stages (actually sequential)
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback38
adding pipeline registers
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
not shown — control logic
39
adding pipeline registers
PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
not shown — control logic
39