CSC 4250 Computer Architectures

CSC 4250Computer Architectures

November 14, 2006

Chapter 4. Instruction-Level Parallelism

& Software Approaches

Fig. 4.1. Latencies of FP ops in Chap. 4

The last column shows the number of intervening clock cycles needed to avoid a stall

The latency of a FP load to a FP store is zero, since the result of the load can be bypassed without stalling the store

Continue to assume an integer load latency of 1 and an integer ALU operation latency of 0

Instruction producing result Instruction using result Latency in clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Loop Unrolling

For (i=1000; i>0; i=i−1)

x[i] = x[i] + s;

The above loop is parallel because the body of each iteration is independent.

MIPS code:

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1)

DADDUI R1,R1,# −8

BNE R1,R2,Loop

Example (p. 305)

Without any pipeline scheduling, the loop executes as follows:

Clock cycle issuedLoop: L.D F0,0(R1) 1

stall 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4,0(R1) 6DADDUI R1,R1,# −8 7stall 8BNE R1,R2,Loop 9stall 10

Overhead = (10−3)/10 = 0.7; 10 cycles per resultHow to reduce the stall to 1 clock cycle?

Example (p. 306)

With some pipeline scheduling, the loop executes as follows:

Clock cycle issuedLoop: L.D F0,0(R1) 1

DADDUI R1,R1,# −8 2ADD.D F4,F0,F2 3stall 4BNE R1,R2,Loop 5S.D F4,8(R1) 6

Overhead = (6−3)/6 = 0.5; 6 cycles per result To schedule the delayed branch, the compiler has to determine

that it can swap DADDUI and S.D by changing the address to which the S.D stores. The change is not trivial. Most compilers would see that S.D depends on DADDUI and would refuse to interchange the two instructions.

Loop Unrolled Four Times ─ Registers not reused

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)L.D F6,−8(R1)ADD.D F8,F6,F2S.D F8,−8(R1)L.D F10,−16(R1)ADD.D F12,F10,F2S.D F12,−16(R1)L.D F14,−24(R1)ADD.D F16,F14,F2S.D F16,−24 (R1)DADDUI R1,R1,# −32BNE R1,R2,Loop

We have eliminated three branches and three decrements of R1 The addresses on the loads and stores have been adjusted This loop runs in 28 cycles ─ each L.D has 1 stall, each ADD.D 2, the DADDUI 1, the

branch 1, plus 14 instruction issue cycles Overhead = (28 − 12)/28 = 4/7 = 0.57; 7 (=28/4) cycles per result

Upper Bound on Loop (p. 307) In real programs, we do not know upper bound of

loop; call it n Let us say we want to unroll the loop k times Instead of one single unrolled loop, we generate a

pair of consecutive loops The first loop executes (n mod k) times and has a

body that is the original loop The second loop is the unrolled body surrounded

by an outer loop that iterates (n/k) times

Schedule Unrolled Loop

Loop: L.D F0,0(R1)L.D F6,−8(R1)L.D F10,−16(R1)L.D F14,−24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,−8(R1)DADDUI R1,R1,# −32S.D F12,16(R1)BNE R1,R2,Loop S.D F16,8 (R1)

This loop runs in 14 cycles ─ there is no stall Overhead = 2/14 = 1/7 = 0.14; 3.5 (=14/4) cycles per result We need to know that the loads and stores are independent and can be interchanged

Loop Unrolling and Scheduling Example Determine that it is legal to move the S.D after the DADDUI and

BNE, and find the amount to adjust the S.D offset Determine that unrolling the loop would be useful by finding that

the loop iterations are independent, except for the loop maintenance code

Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations

Eliminate the extra test and branch instructions and adjust the loop termination and iteration code

Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.

Schedule the code, preserving any dependences needed to yield the same result as the original code

Three Limits to Gains by Loop Unrolling 1. Decease in the amount of overhead amortized with each

unroll. In our example, when we unroll the loop four times, it generates sufficient parallelism among the instructions that the loop can be scheduled with no stall cycles. In 14 clock cycles, only 2 cycles are loop overhead. If the loop is unrolled 8 times, the overhead is reduced from ½ per original iteration to ¼.

2. Code size limitations. For larger loops, the code size growth may become a concern in either the embedded processor space where memory is at a premium or if the larger code size causes a decrease in the instruction cache hit rate.

3. Register Pressure. Scheduling code to increase ILP causes the number of live values to increase. After aggressive instruction scheduling, it may not be possible to allocate all live values to registers.

Schedule Unrolled Loop with Dual Issue

To schedule the loop with no delays, we unroll the loop five times 2.4 (=12/5) cycles per result There are not enough FP instructions to keep the FP pipeline full

Integer instruction FP instruction Clock cycle

Loop: L.D F0,0(R1) 1

L.D F6, −8(R1) 2

L.D F10, −16(R1) ADD.D F4,F0,F2 3

L.D F14, −24(R1) ADD.D F8,F6,F2 4

L.D F18, −32(R1) ADD.D F12,F10,F2 5

S.D F4,0(R1) ADD.D F16,F14,F2 6

S.D F8, −8(R1) ADD.D F20,F18,F2 7

S.D F12, −16(R1) 8

DADDUI R1,R1,#−40 9

S.D F16, 16(R1) 10

BNE R1,R2,Loop 11

S.D F20, 8(R1) 12

Static Branch Prediction

Static branch prediction is used in processors where we expect branch behavior to be highly predictable at compile time.

Delayed branches support static branch prediction. They expose a pipeline hazard so that the compiler can reduce the penalty associated with the hazard. The effectiveness depends on whether we can correctly guess which way a branch will go.

The ability to accurately predict a branch at compile time is helpful for scheduling data hazards. Loop unrolling is one such example. Another example arises from conditional selection branches (next four slides).

Conditional Selection Branches (1)

LD R1,0(R2)

DSUBU R1,R1,R3

BEQZ R1,L

OR R4,R5,R6

DADDU R10,R4,R3

L: DADDU R7,R8,R9 The dependence of DSUBU and BEQZ on LD means

that a stall will be needed after LD. Suppose we know that the branch is almost always

taken and that the value of R7 is not needed on the fall-through path. What should we do?


LD R1,0(R2)DADDU R7,R8,R9DSUBU R1,R1,R3BEQZ R1,LOR R4,R5,R6DADDU R10,R4,R3

L: …

We could increase the speed of execution by moving “DADDU R7,R8,R9” to just after LD

Suppose we know that the branch is rarely taken and that the value of R4 is not needed on the taken path. What should we do?


LD R1,0(R2)

OR R4,R5,R6

DSUBU R1,R1,R3

BEQZ R1,L

DADDU R10,R4,R3

L: DADDU R7,R8,R9

We could increase the speed of execution by moving “OR R4,R5,R6” to just after LD

Also, “scheduling the branch delay slot” in Fig. A.14


Branch Prediction at Compile Time Simplest scheme: Predict branch as taken. The average

misprediction rate for the SPEC programs is 34%, ranging from not very accurate (59%) to highly accurate (9%).

Predict on branch direction, choosing backward-going branches as taken and forward-going branches as not taken. This strategy works for many programs. However, for SPEC, more than half of the forward-going branches are taken, and thus it is better to predict all branches as taken.

A more accurate technique is to predict branches on the basis of profile information collected from earlier runs. The key observation is that the behavior of branches is often bimodally distributed; that is, an individual branch is often highly biased toward taken or untaken.

Misprediction Rate for Profile-based

Predictor Figure 4.3. The misprediction rate on SPEC92 varies widely but is generally

better for the FP programs, with an average misprediction rate of 9% and a standard deviation of 4%, than for the integer programs, with an average misprediction rate of 15% and a standard deviation of 5%.

Comparison of Predicted-taken and Profile-based Strategies Figure 4.4. The figure compares the accuracy of a predicted-taken strategy

and a profile-based predictor for SPEC92 benchmarks as measured by the number of instructions executed between mispredicted branches on a log scale. The average number is 20 for predicted-taken and 110 for profile-based. The difference between the integer and FP benchmarks as groups is large. The corresponding distances are 10 and 30 (for integer), and 46 and 173 (for FP).

Compiler to Format the Instructions Superscalar processors decide on the fly how many instructions

to issue. A statically scheduled superscalar must check for any dependences between instructions in the issue packet as well as between any issue candidate and any instructions already in the pipeline. A statically scheduled superscalar requires significant compiler assistance to achieve good performance. In contrast, a dynamically scheduled superscalar requires less compiler assistance, but has significant hardware costs.

An alternative is to rely on compiler technology to actually format the instructions in a potential issue packet so that the hardware needs not check explicitly for dependences. The compiler may be required to ensure that dependences within the issue packet cannot be present. Such approach offers the potential advantage of simpler hardware while still exhibiting good performance through extensive compiler optimization.

VLIW Architecture

It is a multiple-issue processor that organizes the instruction stream explicitly to avoid dependences. It does so by using wide instructions with multiple operations per instruction. This architecture is named VLIW (very long instruction word), denoting that the instructions, since they contain several instructions, are very wide (64 to 128 bits, or more). Early VLIWs were quite rigid in their instruction formats and required recompilation of programs for different versions of the hardware.

A VLIW uses multiple, independent functional units. It packages the multiple operations into one very long instruction. For example, the instruction may contain five operations, including one integer operation (which could also be a branch), two FP operations, and two memory references.

How to Keep the Functional Units Busy There must be sufficient parallelism in a code

sequence to fill the available operation slots. The parallelism is uncovered by unrolling loops and

scheduling the code within the single larger loop body. If the unrolling generates straight-line code, then local scheduling techniques, which operate on a single basic block, can be used.

If finding and exploiting the parallelism requires scheduling across the branches, a more complex global scheduling algorithm must be used. We will discuss trace scheduling, one global scheduling technique developed specifically for VLIWs.

Example of Straight-line Code Sequence

VLIW 2 memory references, 2 FP operations, and 1 integer or branch instr. per clock cycle Loop: x[i] = x[i] + s Unroll as many times as necessary to eliminate stalls ─ seven times 1.29 (= 9/7) cycles per result

Memory reference 1

Memory reference 2

FP operation 1 FP operation 2 Integer operation / branch

L.D F0,0(R1) L.D F6,,−8(R1)

L.D F10,−16(R1) L.D F14,−24(R1)

L.D F18,−32(R1) L.D F22,−40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2

L.D F26,−48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2

ADD.D F20,F18,F2 ADD.D F24,F22,F2

S.D F4,0(R1) S.D F8,−8R1) ADD.D F28,F26,F2

S.D F12,−16(R1) S.D F16,−16(R1) DADDUI R1,R1,# −56

S.D F20,24(R1) S.D F24,16(R1)

S.D F28,8(R1) BNE R1,R2,Loop

Documents

CSC 4250 Computer Architectures