Upload
kathryn-riley
View
221
Download
0
Embed Size (px)
Citation preview
1
ECE 587Advanced Computer Architecture
IChapter 6
Pipelining
Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 6/27/2015Status 6/27/2015
2
Syllabus High-Level View of Pipelining Idealized Pipeline Definitions Goal of Pipelining Causes for Dependences Stalls and Hazards Realistic Constraints Reservation Tables in Pipelining Collision Vector Vertical Expansion Horizontal Expansion IBM Measurements
3
High-Level View of Pipelining HW Pipelining is an old, 1970s,
architectural design method for accelerated processor execution
Pipelining improves performance not by adding HW, but by separating individual HW modules of a conventional uniprocessor (UP) architecture
Instead of designing one composite, complex piece of HW, the architect for a pipelined machine designs a sequence of simpler and thus faster, consecutive modules
Ideally all modules mi would be of similar complexity
These separate modules mi are significantly simpler each than the original composite processor, and execute in an overlapped (pipelined) fashion, simultaneously progressing on more than one machine instruction at any one time
4
High-Level View of Pipelining Instead of executing one complex
instruction in one longer cycle required for one complex step, a pipelined architecture executes a sequence of multiple, faster sub-instructions
Each sub-instruction is much simpler and thus much faster to execute than the one complex single-instruction
Such single-cycle, pipelined sub-instructions are initiated once per –short– clock-cycle
Each instruction then progresses to completion while migrating through the various stages of the separate hardware modules mi, called the pipeline
5
High-Level View of Pipelining
Left: Traditional Hardware Architecture Right: equivalent Pipelined Architecture
ALU opALU op
I-FetchI-Fetch
R StoreR Store
Decode Decode
O1-FetchO1-Fetch
O2-FetchO2-Fetch
..
I-FetchI-Fetch
Decode Decode
O1-FetchO1-Fetch
O2-FetchO2-Fetch
ALU opALU op
R StoreR Store
6
Idealized Pipeline
Arithmetic Logic Unit (ALU) split into separate, sequential modules
Each of which can be initiated once per cycle, but shorter, pipelined clock cycle
Each module mi is replicated in HW just once, like in a regular UP note exceptions, when module is used more than
once by the original, non-pipelined instruction
example: normalize operation in FP instruction
also FPA in FP multiply operation
Multiple modules operate in parallel on different instructions, at different stages
7
Idealized Pipeline Ideally, all modules require unit time (1
cycle)! Ideal only! Ideally, all original, complex operations
(fp-add, divide, fetch, store, increment by integer 1, etc.) require the same number n of steps to completion
But they do not! E.g. FP divide takes way longer than, say, an integer increment, or a no-op
Differing numbers of cycles per instruction cause different terminations
Operations may abort in intermediate stages, e.g. in case of a pipeline hazard, caused by: branch, call, return, conditional branch, exception
An operation also stalls in case of operand dependence
8
Idealized Pipeline
Instruct. 1 2 3 4 5 6 7 8 9 10 11 12 13 timei8 if de op1 op2 exec wbi7 if de op1 op2 exec wbi6 if de op1 op2 exec wbi5 if de op1 op2 exec wbi4 if de op1 op2 exec wbi3 if de op1 op2 exec wbi2 If de op1 op2 exec wbi1 if de op1 op2 exec wb
Retire ? ? i1 ? i2 ? i3 ? i4 ? i5 ? i6 ? i7 ? i8
6 CPI from here: 1 clock per new instruction retirement, CPI = 1
Horizontal: time, units of cycles. Vertical: consecutive instructions
9
Some Definitions
10
DefinitionsBasic Block Sequence of instructions (one or more) with a
single entry point and a single exit point; entry- and exit w.r.t. transfer of control
Entry point may be the destination of a branch, a fall through from a conditional branch, or the program entry point; i.e. destination of an OS jump
Exit point may be an unconditional branch instruction, a call, a return, or a fall-through
Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place
Or fall-through can mean: The successor of the exit point is a branch or call target
11
Definitions
Collision Vector Observation: An instruction requiring n
cycles to completion may be initiated a second time n cycles after the first without possibility of conflict
For each of the n-1 cycles before that, a further instruction of identical type causes a resource conflict, if initiated
The Boolean vector of length n-1 that represents this fact, whether re-issue is possible, is called collision vector
It can be derived from the Reservation Table
12
DefinitionsCycles Per Instruction: cpi cpi quantifies how long it takes for a
single instruction to execute Generally, the number of execution cycles
per instruction cpi > 1 However, on a pipelined UP architecture,
where a new instruction can be initiated at each cycle, it is conceivable to reach a cpi rate of 1; assuming no hazards
Note different meanings, i.e. durations, of cycle
On a UP pipelined architecture the cpi rate cannot shrink below one
Yet on an MP architecture, or superscalar machine that is pipelined, the rate may be cpi < 1
13
Definitions
Dependence If the logic of the underlying program
imposes an order between two instructions, there exists dependence -data or control dependence- between them
Generally, the order of execution cannot be permuted
Conventional in CS to call this dependence, not dependency
14
Definitions
Early Pipelined Computers/Processors:
1. CDC 6000 Series of the late 1960s
2. CDC Cyber series of the 1970s
3. IBM 360/91 series
4. Intel® Pentium® IV or XeonTM processor families
15
Definitions
Flushing When a hazard occurs due to a change in
flow of control, the partially execution instructions after the hazard are discarded
This discarding is called flushing Antonym: priming Flushing is not needed in case of a
stall caused by dependences; waiting instead will resolve this
16
Definitions
Hazard Instruction i+1 is pre-fetched under the
assumption it would be executed after instruction i
Yet after decoding i it becomes clear that that operation is a control-transfer operation
Hence all subsequently pre-fetched instructions i+1… and on are wasted
This is called a hazard A hazard causes part of the pipeline to
be flushed, while a stall (caused by data dependence) also causes a delay, but a simple wait will resolve such a conflict
17
Definitions
ILP Instruction Level Parallelism:
Architectural attribute, allowing multiple instructions to be executed at the same time
Related: Superscalar
18
Definitions
Interlock If HW detects a conflict during
execution of instructions i and j and i was initiated earlier, such a conflict, called a stall, delays execution of some instructions
Interlock is the architecture’s way to respond to and resolve a stall at the expense of degraded performance
Synonym: delay or wait
19
Definitions
IPC Instructions per cycle: A measure for
Instruction Level Parallelism. How many different instructions are being executed –not necessarily to completion—during one single cycle?
Desired to have an IPC rate > 1, but ideally, given some parallelism, IPC >> 1
On conventional UP CISC architectures it is typical to have IPC << 1
20
Definitions
Pipelining Mode of execution, in which one
instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete
Highly pipelined Xeon processors, for example, have a > 20-stage pipeline
21
Definitions
Prefetch (Instruction Prefetch) Bringing an instruction to the execution
engine before it is reached by the instruction pointer (ip) is called instruction prefetch
Generally this is done, because some other knowledge exists proving that the instruction will be executed soon
22
Definitions
Priming Filling the various modules of a
pipelined processor (the stages) with different instructions to the point of retirement of the first instruction is called priming
Antonym: flushing
23
Definitions
Register Definition If an arithmetic or logical operation
places the result into register ri we say that ri is being defined
Synonym: Writing a register Antonym: Register use
24
Definitions
Reservation Table Table that shows, which hardware
resource (module mi) is being used at which cycle in a multi-cycle instruction
Typically an X written in the Reservation Table Matrix indicates use
Empty field indicates the corresponding resource is free during that cycle
25
Definitions
Retire When all parts of an instruction have
successfully migrated through all execution stages, that instruction is complete
Hence, it can be discarded, this is called being retired
All results have been posted
26
DefinitionsStall If instruction i requires operand o that is
being computed by another instruction j, and j is not complete when i needs o, there exists dependence between the two instructions i and j, the wait thus created is called stall
A stall prevents the two instructions from being executed simultaneously, since the instruction at step i must wait for the other to complete. See also: hazard, interlock
Stall can also be caused by HW resource conflict: Some earlier instruction i may use HW resource m, while another instruction j needs m
Generally j has to wait until i frees m, causing a stall
27
Goal & Obstacles toPipelining
28
Goal of Pipelining
Complete instructions at a rate higher than the number of cycles per instruction allow
Program completion time on a pipelined architecture is shorter than on a non-pipelined architecture, achieved by having the separate hardware module progress on multiple instructions at the same time. The same hardware is being reused frequently
Pipelined instructions must be retired in the original, sequential order, or in an order semantically equivalent
Stalls and hazards should be minimized HW resolves dependence conflicts via
interlocking
29
Causes for Dependences
Load into register ri in one instruction, followed by use of that register ri: True Dependence
Load into register ri in one instruction, followed by use of any register (if hardware does not bother to check register id; e.g. early HP PA)
Definition of register ri in one instruction (other than a load), followed by use of register ri (on hardware designed with severe limitations)
Store into memory followed by a load from memory; generally memory subsystems does not know (i.e. do not check) whether the load comes from the same address as the earlier store; if it does, then the long wait can be bypassed (note PCI and PCI-X protocols)
30
Basic Block: Find Dependences
-- result: is leftmost operand after opcode, except for st-- other operands, if any, are instruction sources-- Mxx is Memory address at xx, implies indirection for ld-- The parens in (Mxx) renders such indirection explicit-- 8(sp) means indirect through sp register, offset by 8-- #4 stands for literal value 4, decimal base
1 ld r2, (M0)2 add sp, r2, #123 st r0, (M1)4 ld r3, -4(sp)5 ld r4, -8(sp)6 add sp, sp, #47 st r2, 0(sp)8 ld r5, (M2)9 add r4, r0, #1
31
Basic Block: Find Dependences
1-2 load of a register followed by use of that register
3-4 store followed by load into any register
3-4 load from memory at -4(sp) while write to memory in progress (M1)
3-5 load from memory at -8(sp) while write to memory in progress (M1)
2-4, 2-5 define register sp, followed by use of same register; distance sufficient to avoid stall on typical architectures
6-7 define register sp before use; forces sequential execution, reduces pipelining
7-8 store followed by load!
8-9 load into register r5 followed by use of any register (on few, simple architectures)
32
Stalls and Hazards Hardware interlock slows down execution
due to delay of dependent instruction, benefit is correct result
Programmer can re-arrange instructions or insert delays at selected places
Compiler can re-schedule instructions, or insert delays (like programmer)
Unless programmer’s / compiler’s effort are provably complete, the hardware interlock must still be provided
33
Stalls and Hazards
CDC 6000 and IBM360/91 already used automatic hardware interlock
Advisable to have compiler re-schedule the instruction sequence, since re-ordering may minimize the number of interlocks actually occurring
34
Stalls and Hazards
Not all HW modules are used for exactly once and only for one single cycle
Some HW modules mi are used more than once in one instruction; e.g. normalizer in floating-point operations
Basic Block analysis is insufficient to detect all stalls or hazards; may span separate Basic Blocks
Can add delay circuits to HW modules, which slows down execution, but ensures correct result
35
Instruction Reservation Tables&
Collision Vectors
36
Reservation Tables in Pipelining Table 1 below, known as a reservation
table, shows an ideal schedule using hardware modules m1 to m6, required for execution of one instruction
Ideal: because each requires exactly 1 cycle per mi, and each HW module is used exactly once
Always 1 cycle is also unrealistic The time to complete 3 instructions i1,
i2, and i3 is 8 cycles, while the time for any single instruction is 6 cycles, clearly a net saving over time:
Since the completion time for a full instruction in the steady state is 1 cycle on the pipelined architecture
37
Reservation Tables in Pipelining For fairness sake: on a none-pipelined
architecture any one of these 3 instructions would NOT necessarily take the equivalent time of 6 cycles of the pipelined machine; perhaps 4, or 5
Also for fairness sake it is not usual that 3 of the same instructions are arranged one after the other
That model is used here to explain Reservation Tables
38
Reservation Tables in Pipelining Key learning: pipelined architecture
does NOT speed up execution of a single instruction, may slow it down; but improves throughput of multiple instruction in a row: only in the steady state
Table 1: Instructions i1 to i6 use 6 HW modules mi
t1 t2 t3 t4 t5 t6 t7 t8
m6 i1 i2 i3
m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3
39
Reservation Tables in Pipelining Table 2 shows a more realistic schedule
of HW modules required for a single instruction
In this schedule some modules are used repeatedly; for example m3 is used 3 times in a row, and m6 4 times
But all these cycles are contiguous; even that is not always a realistic constraint
The schedule in Table 2 attempts to exploit the greedy approach for instruction i2: it is initiated as soon as possible, but we see that this doesn’t help the time to completion
40
Reservation Tables in Pipelining We could have let instruction i2 start
at cycle t4 so no additional delay would be caused by m3
Or instruction i2 could start at t3 with just one additional delay due to m3
However, in both cases m6 would cause a delay later anyway
To schedule i2 we must consider these multi-cycle resources, m3 and m6, that are in use continuously
In case of a load, the actual module would wait many more cycles, until the data arrive, but would not progress until then
41
Reservation Tables in Pipelining
Table 2: Instructions i1 to i3 use 6 H modules mi for 1-4 cycles
t1 t2 t3 t4 t5 t6 t7 t8 t9 t
10 t 11
t 12
t 13
t 14
t 15
t 16
t 17
t 18
t 19
m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3
m5 i1 i2 d2 i3
m4 i1 i2 i3
m3 i1 i1 i1 i2 i2 i2 i3 i3 i3
m2 i1 i2 d2 d2 i3
m1 i1 i2 d d d d d d i3
42
Reservation Tables in Pipelining Instead of using the single resources m3
and m6 repeatedly and continuously, an architect can replicate them as many times as needed in a single instruction
This costs more hardware, and does not speed up execution of a single instruction
For a single operation, all would still have to progress in sequence
But it avoids the delay of subsequent instructions needing the same type of HW module
See Table 3: shaded areas indicate the duplicated hardware modules
43
Reservation Tables in Pipelining
Table 3: Instructions i1 to i3 use replicated HW modules m3 and m6
t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10
t 11
t 12
t 13
t 14
m6,4 i1 i2 I3
m6,3 i1 i2 i3
m6,2 i1 i2 i3
m6 i1 i2 i3
m5 i1 i2 i3
m4 i1 i2 i3
m3,3 i1 i2 I3
m3,2 i1 i2 i3
m3 i1 i2 i3
m2 i1 i2 i3
m1 i1 i2 i3
44
Reservation Tables in Pipelining These replicated circuits in Table 3 do
not speed up the execution of any individual instruction
But by avoiding the delay for other instructions, a higher degree of pipelining is enabled, and multiple instructions can retire earlier
Even this is unrealistically simplistic Some of the modules are used for more
than one cycle, but not necessarily continually
Instead, a Reservation Table offers a more realistic representation
Use the Reservation Table in Table 4 to figure out, how closely same instruction can be scheduled one after another
45
Collision Vector
Best case: the next identical instruction can be scheduled at the next cycle
Worst case: next instruction to be scheduled n cycles after the start of the first, if the first requires n cycles for completion
Goal for a HW designer is to find, how many can be scheduled in between on a regular basis
To analyze this for speed, we use the Reservation Table and Collision Vector (CV)
46
Collision Vector
Goal: Find Collision Vector by overlapping two identical Reservation Tables (e.g. plastic transparencies ) within the window of the cycles of one operation
If, after shifting second transparency i = 1..n-1 time steps, 2 resource-marks of one row land in the same field, we have a collision: both instructions claim that resource at the same time!
Collision means: mark field i in the CV with a 1
Otherwise mark it with a 0, or leave blank
Do so n-1 times, and the CV is complete. But do check for all rows, i.e. for all HW modules mj
47
Collision Vector
Table 5: Find Collision Vector for above instruction
Collision Vector has n-1 entries for n-cycle instruction
Table 4: Reservation Table for 7-step, 5-Module instruction
t1 t2 t3 t4 t5 t6 t7
m1 X m2 X X m3 X X m4 X X m5 X
48
Collision Vector
If a second instruction of the kind shown in Table 4 were initiated one cycle after the first, resource m2 will cause a conflict
This is, because instruction 2 requires m2 at cycles 3 and 4
However, instruction 1 is already using m2 at cycles 2 and 3
At step 3 there would be a conflict Also resource m3 would cause a conflict The good news, however, is that this
double-conflict causes no further entry in CV
49
Collision Vector
Similarly, a new instruction cannot be initiated 2 cycles after the start of the first
This is, because instruction 2 requires m4 at cycles t7 and t9
However, instruction 1 is already using m4 at t5 and t7. At step t7 there would be a conflict
At all other steps a second instruction may be initiated. See the completed CV in Table 6
Table 6: Collision Vector for above 7-cycle, 5-module instruction
1 1 0 0 0 0
50
Reservation Table: Main Example
The next example is a hypothetical instruction of a hypothetical architecture, characterized by the Reservation Table 7, 7 cycles, 4 modules
It will be used throughout this section
Table 7: Reservation Table for 7-cycle, 4-module Main Example
t1 t2 t3 t4 t5 t6 t7
m1 X X X m2 X X m3 X X m4 X
51
Reservation Table: Main Example The Collision Vector for the Main
Example says: We can start new instruction of the same kind at step t6 or t7
Of course, we can always start a new instruction, identical or of another type, after the current one has completed; no resource will be in use then
Challenge is to start another, while the current is still executing one has completed. No resource will be in use then. The challenge is to start another one while the current is executing
Table 8: Collision Vector for Main Example
1 1 1 1 0 0
Goal now is to show, that by adding delays we can sometimes speed up execution of pipelined ops! to start another one while the current is executing
52
Main Example Pipelined
For Main Example, initiate a second, pipelined instruction Y at step t6 5 cycles after start of X
Greedy Approach to pipeline X and Y as follows:
Observe two-cycle overlap. This is all the speed-gain we can gain. Starting Y earlier (greedy approach) would create delays, and not retire the second one any earlier
Table 9: Pipelining 2 Instructions of Main Example
t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10
t 11
t 12
m1 X X X Y Y Y Z m2 X X Y Y Z m3 X X Y Y m4 X Y
53
Main Example Pipelined The 3rd pipelined instruction Z can start
at time step t11, by which time the first is retired, the second half-way through
The fourth instruction can start at step t16, etc.
t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10
t 11
t 12
t 13
t 14
t 15
t 16
t 17
m1 X X X Y Y Y Z Z Z m2 X X Y Y Z Z m3 X X Y Y Z Z m4 X Y Z
Table 10: Pipelining 3 Instructions of Main Example
54
Main Example Pipelined Though Reservation Table 7 for the Main
Example is sparsely populated, no high degree of pipelining is possible
The maximum overlap is 2 cycles Can one already infer this low degree of
pipelining from the Collision Vector? In pipelined execution there are 5 cycles per instruction retirement in the steady state, cpi = 5
That means 5 cycles per completion of an instruction, assuming the same instruction is executed over and over, and only after the steady state has been reached, not during the priming phase!
We’ll come back to the Main Example and analyze it after further study of Examples 2 and 3
55
Pipeline Example 2 Reservation Table for Example 2 has 7
entries, 24 fields, 6 steps, density = 0.29166. Main Example has 8 entries in 28 fields, density = 0.28571
We’ll attempt to pipeline as many identical Example 2 instructions as possible
Table 12: Collision Vector for Example 2
Table 11: Reservation Table for Example 2
1 0 1 0 1
t1 t2 t3 t4 t5 t6
m1 X X m2 X X m3 X m4 X X
56
Pipeline Example 2 The Collision Vector suggests to
initiate a new pipelined instruction at time t3, t5, t7, etc.
That would allow 3 instructions simultaneously, overlapped, pipelined. By step t7 the first instruction would already be retired
Table 13: Schedule for Pipelining Example 2
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
m1 X Y Z X A Y B Z A B m2 X Y X Z Y A Z B A B m3 X Y Z A B m4 X X Y Y Z Z A A B B
57
Pipeline Example 2 We were just lucky to pipeline 3
identical instructions at the same time The CV is not a direct indicator. The
reader was mildly misled to make inferences that don’t strictly follow
However, if all positions in the CV were marked 1, there would be no pipelining
For Example 2 the number of cycles per instruction retirement is cpi = 2
Even though the operation density is slightly higher than in the Main Example, the pipelining overlap is significantly higher, which is counter-intuitive! On to Example 3!
58
Pipeline Example 3 Interesting to see Example 3, analyzing
the Collision Vector, how much we can pipeline. The Reservation Table has numerous resource fields filled, yet the Collision Vector is sparser than the one in Example 2
Table 15: Collision Vector for Example 3
Table 14: Reservation Table for Example 3
t1 t2 t3 t4 t5 t6
m1 X X X m2 X X X m3 X X X m4 X X X
0 1 0 1 0
59
Pipeline Example 3 The Collision Vector suggests to start
new pipelined instruction 1, 3, or 5 cycles after initiation of the first
CV is less packed with 1s than the previous case Example 2, where we could overlap 3 identical instructions and get a rate of cpi = 2. Goal to find the best cpi rate for Example 3
Table 16: Schedule for Pipelining Example 3
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
m1 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m2 X Y X Y X Y X1 Y2 X1 Y2 X1 m3 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m4 X Y X Y X Y X1 Y2 X1 Y2 X1
60
Pipeline Example 3 Example 2 earlier with Collision Vector
10101 allows a higher degree of pipelining
Here in Example 3, cpi = 3, every 6 cycles two instructions can retire
This is in contrast to cpi = 2 of Example 2
The reason for the lower retirement rate is clear:
All 4 HW modules are used every other cycle by one of two instructions, thus one cannot overlap more than twice
61
Vertical Expansion for Example 3 If we need higher degree of pipelining
for Example 3 with fill-factor of 0.5, we must pay! Vertically with more hardware, or horizontally with more time for added delays
Analyze vertically expanded Reservation Table now with 8 Modules; every hardware resource m1 to m4 replicated; density = 0.25
Table 17: Reservation Table Example 3 with Replicated HW
t1 t2 t3 t4 t5 t6
m1 X X m2 X X m3 X m4 X m1,2 X m2,2 X m3,2 X X m4,2 X X
62
Vertical Expansion for Example 3 Let us pipeline multiple identical
instructions for Reservation Table 17 as densely as possible
With twice the HW, can we overlap perhaps twice as much? The previous rate with half the hardware was cpi = 3. A solution is shown in Table 18
Table 18: Schedule for Pipelining Example 3
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
m1 X Y Z A X Y Z A X Y Z A m2 X Y Z A X Y Z A X Y Z m3 X Y Z A X Y m4 X Y Z A X m1,2 X Y Z A X Y m2,2 X Y Z A X m3,2 X Y Z A X Y Z A X Y Z m4,2 X Y Z A X Y Z A X Y
63
Vertical Expansion for Example 3 The initiation rate and retirement rate
are 4 instructions per 8 cycles, cpi = 2 This is, as suspected, better than the
rate of the original Example 3, not surprising with double the hardware modules
But this is not twice as good a retirement rate. The original rate was cpi = 3, the modified rate with double the hardware is cpi = 2.
Our next case, a variation of the Main Example, shows an expansion of the Reservation Table horizontally
I.e. delays are built-in, but HW modules are kept constant. Only the 4 modules m1 to m4 from Main Example are provided
64
Horizontal Expansion, Main Example
After Examples 2 and 3, we expand the Main Example, repeated below, by adding delays, AKA Horizontal Expansion
If we insert delay cycles, clearly execution for a single instruction will slow down
However, if this yields a sufficient increase in the degree of pipelining, it may still be a win
Building circuits to delay an instruction is cheap
We analyze this variation next:
65
Horizontal Expansion, Main Example
t1 t2 t3 t4 t5 t6 t7
m1 X X X m2 X X m3 X X m4 X
Table 19: Original Reservation Table for Main Example
Inserting a Delay Cycle after t3, will be the new t4
66
Horizontal Expansion, Main Example
We’ll insert delays; but where? An methodical way to compute optimum position, number not shown
Instead, we’ll suggest a sample position for a single delay and analyze the performance impact
Table 20 shows delay inserted after cycle 3
Table 21: Collision Vector for Main Example with 1 Delay
Table 20: Reservation Table for Main Example with 1 Delay, at t4
t1 t2 t3 t4 t5 t6 t7 t8
m1 X X X m2 X X m3 X X m4 X
0 1 0 1 1 0 0
67
Horizontal Expansion, Main Example
The Greedy Approach is to schedule instructions Y as soon as possible, when CV has a 0 entry
This would lead us to initiate a second instruction Y at time step t2, one cycle after instruction X. Is it optimal?
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
t11
t12
t13
t14
t15
t16
m1 X Y X Y X Y Z A Z A Z A m2 X Y X Y Z A Z A m3 X Y X Y Z A Z A m4 X Y Z A
Table 22: Schedule for Pipelined Instructions X, Y, Z,
With Delay Slot, Using Greedy Approach
68
Horizontal Expansion, Main Example
Initiation and retirement rates are 2 instructions every 7 cycles, or cpi = 3.5; see the purple header at each retired instruction in Table 23
This is already better than cpi = 5 for the original Main Example without the delay
Hence we have shown that adding delays can speed up the throughput of pipelined instructions
But can we do better? After all, we have only tried out the
first sample of a Greedy Approach!
69
Horizontal Expansion, Main Example
In this experiment we start the second instruction at cycle t4, three cycles after the start of the first
Which cpi shall we get? See Table 23
Table 23: Schedule for Pipelined Main Example, with delay.
Initiation Later than First OpportunityResulting in better Throughput:
Message: Start later, makes it Slower, to Run Faster
t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10
t 11
t 12
t 13
t 14
t 15
t 16
t 17
t 18
m1 X X Y X Y Z Y Z X Z X Y X Y Z Y Z m2 X Y X Z Y X Z Y X Z m3 X X Y Y Z Z X X Y m4 X Y Z X Y
70
Horizontal Expansion, Main Example
In the more patient schedule of Table 23 we complete one identical instruction every 3 cycles in the steady state
Purple cells indicate instruction retirement
X retires at completion t8, Y after t11, and Z after t14
Then X again after t17 Now cpi = 3 with the Not-So-Greedy
Approach Key learning: To speed up pipelined
execution, one can sometimes enhance throughput by adding delay circuits, or by replicating hardware, or postponing instruction initiation, or a combination of the above
The greedy approach is not necessarily optimal
The collision vector only states, when one cannot initiate a new instruction (value 1); a 0 value is not a solid hint for initiating a new instruction
71
IBM MeasurementsAgarwala and Cocke 1987; see [1]
Memory Bandwidth:1 word/cycle to fetch 1 instruction/cycle from I-cache
40% of instructions are memory-access (load-store)
Those would all benefit from access to D-cache
Code Characteristics, dynamic:25% of all instructions: loads15% of all instructions: stores40% of all instructions: ALU/RR20% of all instructions: Branches
1/3 unconditional1/3 conditional taken1/3 conditional not taken
72
How Can Pipelining Work? About 1 out of 4 or 5 instructions is a
branch Branches include all transfer-of-control
instructions; these are: call, return, unconditional and conditional branch, abort, exception and similar machine instructions
If a processor pipeline is deeper than say, 5 stages, there will almost always be a branch in the pipeline, rendering the several of the perfected operations useless
Some processors (e.g. Intel Willamette, [6]) have 20 stages. For this processor pipelining would always cause a stall
73
How Can Pipelining Work? Remedy is branch prediction If the processor knows dynamically, from
which address to fetch, instead of blindly assuming the subsequent code address pc+1, this would solve the pipeline flushes
Luckily, branch prediction in the 2010s has become about 97% accurate, causing only rarely the need to re-prime the pipe
Also, processors no longer are designed with the deep pipeline of the Willamette, of about 20 stages
Here we see interesting interactions of several computer architecture principles: pipelining and branch prediction, one helping the other to become exceedingly advantageous
74
Summary
Pipelining can speed up execution Not due to the faster clock rate That fast clock rate is for
significantly simpler sub-instructions and cannot be equated with the original, i.e. non-pipelined clock
Pipelining may even benefit from inserting delays
May also benefit from initating an instruction later than possible
And benefits, not surprisingly, from added HW resources
Branch prediction is a necessary architecture attribute to make pipelining work
75
Bibliography
1. Cocke and Schwartz, Programming Languages and their Compilers, unpublished, 1969, http://portal.acm.org/itation.cfm?id=1097042
2. Harold Stone, High Performance Computer Architecture, 1993 AW
3. cpi rate: http://en.wikipedia.org/wiki/Cycles_per_instruction
4. Introduction to PCI: http://electrofriends.com/articles/computer-science/protocol/introduction-to-pci-protocol/
5. Wiki PCI page: http://en.wikipedia.org/wiki/Conventional_PCI
6. http://en.wikipedia.org/wiki/NetBurst_(microarchitecture)