Software Exploits for ILP

Software Exploits for ILP• We have already looked at compiler scheduling to

support ILP– Altering code to reduce stalls– Loop unrolling and scheduling– Compiler-based scheduling for superscalars– VLIW

• Here, we examine appendix H to see additional compiler-based ideas and hardware to help support some of these ideas– Few architectures have focused on static-based approaches

aside from minimal support with compiler scheduling– However, the EPIC architecture heavily relied on it, here we

will view EPIC and consider to what extent it succeeds (or fails) over dynamic approaches

Loop Dependencies• To support loop unrolling, the compiler must detect any

dependencies that exist both within and between loop iterations– Within loop iterations are the typical RAW, WAW and WAR hazards– Between loop iterations, RAW, WAW and WAR hazards may be

hard to identify because an array index may not match exactly– Consider the following two loop bodies, both iterate over i from 1

to 100• x[i] = x[i] + s;• x[i] = x[i-1] + s;

– In the first, the RAW hazard will not cause any stalling behavior, nor will it be complicated by loop unrolling because reads happen before writes

– But in the second, the RAW hazard exists across loop iterations so that an unrolled loop could lead to problems

Example Examined• Consider this code:

– x[0] = …;– for(i=1;i<=100;i++) x[i] = x[i-1] + s;

• Assume x is an FP array so that the additions take 4 cycles– Let’s unroll this loop to contain 4 iterations per loop, this would give

us the following four assignments in the first unrolled loop iteration:– x[1] = x[0] + s;– x[2] = x[1] + s;– x[3] = x[2] + s;– x[4] = x[3] + s;

• If schedule the above code, we would first attempt to L.D x[0], x[1], x[2], x[3], then do the ADD.Ds and finally the S.Ds– But each S.D is needed before the next ADD.D

• If the compiler doesn’t detect this dependence, the code will be incorrect!

Forms of Dependencies• There are 3 forms of dependencies– True or data dependencies – these are the same as RAW

hazards as found in pipelining• we have to make sure that the value is written before we can

subsequently read/use it– Name dependencies – these arise because the same named

entity is referenced, but the data differs• for instance, we put a result in R1 and use it in a later instruction, but

yet another instruction places a completely unrelated datum in R1– There are two forms of name dependencies

• output dependencies – these arise when two instructions write two independent results to a named location without an intervening read, that is, these are WAW hazards

• antidependencies – these arise when a read and write must occur in the proper order so that the read takes place before the write, these are WAR hazards

Example• Find the dependencies

– both within a single loop iteration and across loop iterations– identifying each type– is the loop parallelizable (unrollable)?

for (i=1;i<=100;i=i+1) { for(i=1;i<=100;i=i+1) {y[i]=x[i]/c; /* S1 */ t[i]=x[i]/c;x[i]=x[i]+c; /* S2 */ x1[i]=x[i]+c;z[i]=y[i]+c; /* S3 */ z[i]=t[i]+c;y[i]=c-y[i]; /* S4 */ y[i]=c-t[i];

} }True: from S1 to S3 (y), from S1 to S4 (y)Anti: from S1 to S2 (x), from S3 to S4 (y)Output: from S1 to S4 (y)

As is the loop is not parallelizable, but if we use renaming on x (S2) and y(S1, S3 and S4), we can unroll and schedule the code – notice that we haverenamed x to x1 for this to work, later code would have to reference x1

Example• The previous example had no loop carried dependencies– These can be tricky to find

• just because an array is specified as something other than index i does not mean that there is a loop carried dependence

• we will examine how to prove a loop carried dependence exists in a couple of slides, consider this loop:

– for(i=1;i<=100;i++) {• a[i+1] = a[i] + c[i]; // S1• b[i+1] = b[i] + a[i+1]; // S2• }

– This code has the following true dependencies• a from S1 to S2• a from S1 to S1 (loop carried) • b from S2 to S2 (loop carried)

– The loop carried dependencies, at least in this case, prevent the loop from being parallelizable

Example• Not all loop carried dependencies prevent a loop from being

parallelizable, consider this example– for(i=1;i<=100;i++) {

• a[i] = a[i] + b[i]; // S1• b[i+1] = c[i] + d[i]; // S2• }• here, we have a loop carried dependence on b from S2 to S1 (the dependence

from a to a in S1 is not loop carried)– To parallelize this loop, we must eliminate the dependence

• this change requires adding an initial S1 before the loop and a final S2 after the loop

– a[1] = a[1] + b[1]; // initial S1– for(i=1;i<=100;i++) {

• b[i+1] = c[i] + d[i]; // S2• a[i+1] = a[i+1] + b[i+1]; // S1• }

– b[101] = c[101] + d[101]; // final S2

Recurrences• The key to identifying if a loop carries a dependence across

iterations is to find if a recurrence of loop indices arises– A recurrence is when a loop index for a given variable is reused

in another iteration– With a[i], a[i+1], the recurrence is easy to detect, but consider

these two statements:• a[i] = b[2*i] + c; // S1• b[2*i+1] = d[i] * e; // S2

– There is no recurrence of b between S1 and S2 because the index in S1 is always even and the index in S2 is always odd

• Identifying a recurrence can be computationally challenging, there are a number of tests we can apply– The tests can confirm a loop carried dependence but if the test

fails, it does not tell us anything because there are other tests that might be applicable

GCD Test• One easy test is applied when array indices are affine– Basically, an affine index is one that can be written in the

from a*i + b where i is the loop index and a and b are integer constants

– The GCD test says that if two indices of the same array are affine then a dependence exists if these three conditions hold:• there are two iteration indices, j and k, within the bounds of the loop• the loop stores into an array element by a*j+b and later fetches from

the same array at c*k+d• the value of d – b is evenly divisible by the greatest common divisor

of c and a– Examples:

• x[2*i+3] = x[2*i]: 2 does not divide -3, test fails (cannot conclude anything)

• y[5*i – 4] = y[15*i + 6]: 5 does divide 10 (6 - -4) so there is a loop carried dependence (this arises, for instance when i = 8 and i = 2

Dependence Challenges• There are a number of challenges that complicate loop

dependence analysis– When storage is referenced by pointer rather than array index– When array indexing is indirect through another array– When dependencies exist for a subset of inputs but do not arise

under other sets of inputs• Although pointers pose a very difficult problem to static

analysis (since pointers take on their values at run time), there are some forms of analysis available– If two pointers cannot point to the same type, there can be no

dependence– If an object being referenced by a pointer is only allocated under

conditions that differ from those of another pointer– If one pointer can only point to a local referent while another points

only to a global

Eliminating Computation• Aside from loop unrolling/scheduling, another

useful pursuit for the compiler is to replace some common computations by storing the first result in a register– This can take on multiple forms

• DADDI R1, R2, #4• DADDI R1, R1, #4• becomes• DADDI R1, R1, #8

– We can take advantage of associativity as shown to the right

– Additionally, if a particular computation, say c + d, is used in several locations, place c + d into a register and replace all further uses of the computation with the register

DADD R1, R2, R3DADD R4, R1, R6DADD R8, R4, R7Here, we have two RAWHazards that might causeStalls, we replace themWith the code belowDADD R1, R2, R3DADD R4, R6, R7DADD R8, R1, R4

Software Pipelining• As we saw earlier, a compiler can rearrange code in a

loop to remove loop carried dependencies– The compiler can also rearrange the code to hide true

dependencies found within an iteration through a technique called symbolic loop unrolling

– The idea is to identify in each iteration of a loop, the instruction that can be paired with a previous and successive loop iteration

– For instance, if one iteration performs a FP add which takes multiple cycles, the store for that add can be moved to the next iteration • to prevent having to use multiple groups of registers, we would place

the store before the add of that iteration since we would be storing the sum from the previous iteration

• this may require adding “startup” and “cleanup” code

Continued• Pictorially, the concept works as follows:

– In iteration 0 we select the last instruction in the loop that has the dependence– In iteration 1 we select the second to last instruction, etc– In the last iteration we select the first instruction in the loop that has the

dependence• We add startup code so that the instructions preceding the last

instruction from iteration 0 are still available• We add cleanup code so that the instructions from the last iteration that

follow the first instruction are performed

ExampleLoop: L.D F0,0(R1)

ADD.D F4,F0,F2S.D F4,0(R1)DSUBI R1,R1,#8BNE R1,R2,Loop

Iteration i: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)

Iteration i+1: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)

Iteration i+2: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)

Bold-faced instructions are unrolled

L.D F0, 16(R1)L.D F6, 8(R1)ADD.D F4, F6, F2

Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DSUBI R1,R1,#8BNE R1,R2,Loop

ADD.D F8, F0, F2S.D F4, 8(R1)S.D F8, 0(R1)

Startup

Cleanup

Code Scheduling• So far, our code scheduling has been limited to

– Moving code within a basic block to fill stalls– Loop unrolling and scheduling

• What about moving code across conditional branches?– With branch history, we can make predictions on whether a

branch will be taken or not– Are the benefits of moving code to avoid branch delays worth the

risk of guessing wrong?• Branch speculation requires several supporting mechanisms

– A buffer to consult that provides the branch prediction and branch target location

– A mechanism for “killing off” the speculated operation(s) if the prediction is wrong

– A mechanism to ensure that speculated code does not raise an exception unless/until the speculation is proved correct

Example• Consider the skeleton of an if-

else statement to the right, we have some options for code scheduling– Move B(i) before the condition

• only useful if condition is usually true and executing it will not impact X or C(i)

– Move X before the condition• only useful if condition is usually

false and executing it will not impact B(i) or C(i)

– Move C(i) before the condition, or into one of the clauses• only useful if executing it does not

impact condition, B(i) or X

Variants• Move B(i) up before the condition– In X, reset B(i)

• that is, if the condition turns out to be false, wipe out the B(i) assignment statement (reset it)

• Move C(i) up before the condition– Doable if we can ensure that neither the condition, B(i) or X

would be impacted• let’s assume that X would be impacted, if we predict the else clause is

rarely taken, we could reset C(i) before doing X in the else clause

• The question comes down to– Is the benefit from a correct prediction more than the cost when

incorrect?– Again, the compiler has to ensure that the movement does not

cause an incorrect condition result or incorrect values from the if clause and else clause

Example• Let’s use the code – if(x > y) x++; else y--;

• Further, let’s assume that x > y is true 90% of the time

• Our original code is shown to the right– if true, 4 instructions are executed and if

false, 3 instructions are executed– let’s assume no stalls, each instruction

has a CPI of 1– average CPI = 4 * .9 + 3 * .1 = 3.9

• Given the prediction (x > y), the compiler generates the code to the right– if true, 3 instructions are executed and if

false, 5– average CPI = 3 * .9 + 5 * .1 = 3.2, – a speedup of 3.9 / 3.2 = 1.22 (22%)

SGT R3, R1, R2BEQZ R3, elseDADDI R1, R1, #1J next

else: DSUBI R2, R2, #1next: …

SGT R3, R1, R2DADDIR1, R1, #1BNEZ R3, nextDSUBI R1, R1, #1DSUBI R2, R2, #1

next: …

Trace Scheduling• In the previous example, we

selected the “critical path” – the most common path through the selection statement– Typically, such a conditional is

found inside of a loop• In trace scheduling, we combine

selecting the critical path with loop unrolling so that we move the critical path out of the selection in multiple iterations– In order to handle the miss-

prediction, we have exits out of the unrolled code and entrances to re-enter after handling the miss-prediction

Superblocks• The numerous entries and exits in our previous figure

indicates a major drawback of trace scheduling– First, it requires that the compiler build mechanisms for

recovering from miss-predictions into the unrolled code• for instance, imagine that we unroll a loop 4 times, the compiler then

has to build into the code what to do if the miss-prediction occurs in the first iteration and how to re-enter, if the miss-prediction occurs in the second iteration and how to re-enter, etc

– Second, it increases the amount of code required and complicates the code

• The superblock uses the same idea except that, upon exiting, you enter a different block which foregoes speculation– When the loop terminates, the superblock is re-entered– This is done using a technique called tail duplication

Example Superbloc

k

Example• Assume our code is:• for(i=0;i<n;i++) if(a[i]>0) x++; else x--;– In most cases, a[i] is positive

• We choose to move x++ out of the selection statement and replace the selection with if(a[i]<=0) x=x-2;– That is, we add 1 to x automatically and if we miss-speculate, we

subtract 2 from x• We then unroll the loop giving us the following (in C rather

than assembly)– for(i=0;i<n/4;i+=4) {

• x+=4; if(a[i]<=0) {…} // code in the { } requires• if(a[i+1]<=0) {…} // subtracting from x, and then• if(a[i+2]<=0) {…} // completing the remaining• if(a[i+3]<=0 {…} // loop iterations using the

– } // original code

Predicated Instructions• We have seen that with every loop and every selection statement,

there are branches – Which could result in branch delays, or speculation that when miss-

speculated can lead to stalls• If the condition and action are simple enough, can we do them

without a branch?– The answer is yes, if we use predicated (or conditional) instructions

• The idea is that the condition and the action are both performed but that if the condition is determined false, the register write is canceled

• In most cases, predicated instructions can – only use a simple condition: value = 0 or value != 0– only have a single, simple action such as x = y

• Here, we consider two:– CMOVZ – conditional move– LWC – conditional load

Examples• The code if(A==0) {S=T;} can be implemented in MIPS as

– BNEZ R1, L– ADD R2, R3, R0– L: …

• Or with the MOVZ instruction as– MOVZ R2, R3, R1 – Move R3 to R2 if R1 = 0 (if R1 != 0) cancel the move before it is

finalized• assuming the MIPS pipeline, we reduce from 2 instructions to 1 and remove the

branch penalty, so a potential savings of 2 clock cycles

• In MIPS, the MOVZ is the only predicated instruction, but other architectures offer others such as LWC– Load if condition is true– LWC R2, 0(R3), R1 – load 0(R3) into R2 if R1 = 0– The instruction performs 0+R3 and R1 = 0 test in EX stage and either

loads 0+R3 into R2 in MEM and WB respectively or does nothing in MEM/WB depending on the result of the condition

Handling Exceptions• Whether we use predicated instructions or through

compiler scheduling (e.g., trace scheduling)– an instruction that raises an exception that should not have

executed because of miss-speculation should not cause the exception• recall exceptions may invoke an exception handler, which is very time

consuming, or may cause program termination • we need a way to recover from a miss-speculated exception situation

– in the former case, we can invoke the exception handler and cancel it later if we determine the instruction was miss-speculated – this wastes some time but preserves proper behavior

– for the latter case, we need a mechanism to ensure that the exception is either not raised or ignored until we know whether the speculation was correct or not

Four Approaches• Hardware and OS work cooperatively to ignore exceptions of

speculative instructions– this only works for correct programs

• Speculative instructions are not permitted to raise exceptions– speculative instructions are annotated as such

• for instance a speculative load might be sLW– we disallow the instruction from raising an exception

• Poison bits are attached to registers to indicate if their value was the result of speculation– we add a bit to every register, a speculated instruction that writes to the

register sets the bit, a register written to as a result of a register with the set poison bit is also set

– exceptions are disallowed for any instruction with a set poison bit until the instruction’s speculation is known

• Buffers used to store results of speculated instructions– like a reorder buffer, we only permit results to move beyond the buffer once

the speculation is know, until then, exceptions are buffered

Example• Imagine we have the following instruction:

– If (A==0) A = B; else A = A + 4;• The original code is shown on the left but the compiler uses

speculation to generate better code on the right– If the speculation is true 90% of the time the code goes from .90 * 5

+ .10 * 4 = 4.9 to .90 * 4 +.10 * 5 = 4.1 cycles (speedup of about 19%)

LW R1, 0(R3) BNEZ R1, L1 LW R1, 0(R2) J L2 L1: DADDI R1, R1, #4 L2: SW R1, 0(R3)

LW R1, 0(R3) LW R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW R14, 0(R3)

The speculated code adds register R14 so that the value R1 is not destroyed if we have a miss-speculation. Additionally, we do not want the SW to take place until the speculation is known

Continued• While the previous example ensured the proper value was stored to A, it did nothing to prevent an exception from a miss-speculation– Specifically, we do not want to load B

(0(R2)) if A is not 0• imagine that A is not 0 and 0(R2) causes

a memory violation, this would cause an exception that should never arise

– We indicate that the load for B is speculative and now we add a new instruction called SPECCK – speculative check – this ensures that an exception only arises if the speculated instruction should have executed AND it caused an exception

LD R1,0(R3) sLD R14,0(R2) BNEZ R1,L1 SPECCK 0(R2)

J L2 L1: DADDI R14,R1,#4 L2: SD R14,0(R3)

Limitations on Speculation• Instructions that are annulled (turned into no-ops) still

take execution time • Conditional instructions are most useful when the

condition can be evaluated early – such as during the ID stage of our pipeline

• Speculated instructions may cause a slow down compared to unconditional instructions requiring either a slower clock rate or greater number of cycles

• The use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence– for example, moving an instruction across multiple branches

requires making it conditional on both branches, which requires two conditions to be specified or requires additional instructions to compute the controlling predicate

– if such capabilities are not present, the overhead of if conversion will be larger, reducing its advantage

Intel IA-64/EPIC• This chapter introduced a number of compiler-based

strategies to promote ILP to support a superscalar processor– To date, very few processors have attempted to

aggressively schedule parallel instructions through the compiler, instead relying on hardware scheduling

– The IA-64 is one of the few, here we look at a few highlights of the instruction set and see how instructions are bundled together to issue in a VLIW-like way• 128 65-bit registers (1 poison bit included)• 128 82-bit FP registers• 64 1-bit predicate registers• 8 64-bit branch registers (for indirect branching)• register stack for parameter passing (rather than memory)

Instruction Format• Compiler uses a number of strategies to provide ILP– Loop unrolling– Speculation– Scheduling, etc

• Compiler selects up to 3 consecutive instructions to place into a “bundle”– A bundle is 128 bits wide consisting of

• a 5-bit template field• up to three instructions which are 41 bits each (or no-ops as

necessary)• the 5-bit template describes what each type of instruction is, and

each type has its own formatting so some of the instruction information is encoded in the template

• the template includes whether a stop should exist – stops denote the need for stalls

Bundle Components• All instructions

break into one of 5 types:– I: integer ALU,

non-ALU integer– M: memory (int &

FP), integer ALU– F: floating point– B: branches and

conditional instructions

– L+X: instructions with extended immediate data, stops, and no-ops

Template # Slot 0 Slot 1 Slot 20 M I I1 M I I2 M I I3 M I I…8 M M I9 M M I…12 M F I13 M F I14 M M F15 M M F…29 M F B

See figure H.7 for full table

Example• Unroll the x[i]=x[i]+s; loop seven times and schedule the instructions

in IA-64 bundles– First to minimize bundles– Second to minimize cycles

Loop: L.D F0, 0(R1) S.D F4, 0(R1)L.D F6, -8(R1) S.D F8, -8(R1)L.D F10, -16(R1) S.D F12, -16(R1)L.D F14, -24(R1) S.D F16, -24(R1)L.D F18, -32(R1) S.D F20, -32(R1)L.D F22, -40(R1) S.D F24, -40(R1)L.D F26, -48(R1) S.D F28, -48(R1)ADD.D F4, F0, F2 DADDI R1, R1, #-56ADD.D F8, F6, F2 BNE R1, R2, LoopADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2

Bundle Template

Slot 0 Slot 1 Slot 2 Execute Cycle

9: MMI L.D F0, 0(R1) L.D F6, -8(R1) 114: MMF L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 315: MMF L.D F18, -32(R1) L.D F22, -40(R1) ADD.D F8, F6, F2 415: MMF L.D F26, -48(R1) S.D F4, 0(R1) ADD.D F12, F10, F2 615: MMF S.D F8, -8(R1) S.D F12, -16(R1) ADD.D F16, F14, F2 915: MMF S.D F16, -24(R1) ADD.D F20, F18, F2 1215: MMF S.D F20, -(R1) ADD.D F24, F22, F2 1515: MMF S.D F24, -40(R1) ADD.D F28, F26, F2 1816: MIB S.D F28, -48(R1) DADDUI R1, R1, #-56 BNE R1, R2, Loop 21

Bundle Template

Slot 0 Slot 1 Slot 2 Execute Cycle

8: MMI L.D F0, 0(R1) L.D F6, -8(R1) 19: MMI L.D F10, -16(R1) L.D F14, -24(R1) 214: MMF L.D F18, -32(R1) L.D F22, -40(R1) ADD.D F4, F0, F2 314: MMF L.D F26, -48(R1) ADD.D F8, F6, F2 415: MMF ADD.D F12, F10, F2 514: MMF S.D F4, 0(R1) ADD.D F16, F14, F2 614: MMF S.D F8, -8(R1) ADD.D F20, F18, F2 715: MMF S.D F12, -16(R1) ADD.D F24, F22, F2 814: MMF S.D F16, -24(R1) ADD.D F28, F26, F2 99: MMI S.D F20, -32(R1) S.D F24, -40(R1) 1116: MIB S.D F28, -48(R1) DADDUI R1, R1, #-56 BNE R1, R2, Loop 12

Speculation Support• Nearly every instruction can be predicated– This is done by specifying one of the predicate registers

• a conditional branch can become an unconditional branch with a predicate register

– Predicate registers are set using a compare or test instruction• hardware supports predication by controlling when exceptions

are handled – for a predicated instruction, an exception can only be handled once the predicate’s result is known and by using registers with poison bits

• the compiler is tasked with generating recovery code for exceptions that arise because of miss-speculation

– Speculated loads use a special table so that if the load is miss-speculated, it does not wipe out a current register value

Itanium 2 Performance• The IA-64/EPIC instruction set was implemented in the

Itanium 2 processor with a 1.5 GHz clock– The graph below compares its performance on int and FP

benchmarks to those of Pentium IV (3.8 GHz), AMD Athlon and Power5• the Itanium 2 compares favorably to Pentium IV & Athlon for FP

benchmarks but is slower on int benchmarks

Documents

Software Exploits for ILP