Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:

Introduction Project Structure Judging de-optimizations What does a de-op look like? General Areas of Focus Instruction Fetching and Decoding Instruction Scheduling Instruction Type Usage (e.g. Integer vs. FP) Branch Prediction Conclusion

De-optimization? That's crazy! Why??? In the world of hardware development, when optimizations are compared, the comparisons often concern just how fast a piece of hardware can run an algorithm Yet, in the world of software development, the hardware is often a distant afterthought Given this dichotomy, how relevant are these standard analyses and comparisons?

So, why not find out how bad it can get? By de-optimizing software, we can see how bad algorithmic performance can be if hardware isn't considered At a minimum, we want to be able to answer two questions: How good of a compiler writer must someone be? How good of a programmer must someone be?

For our research project: We have been studying instruction fetching/ decoding/ scheduling and branch optimization We have been using knowledge of optimizations to design and predict de-optimizations We have been studying the Opteron in detail

For our implementation project: We will choose de-optimizations to implement We will choose algorithms that may best reflect our de- optimizations We will implement the de-optimizations We will report the results

We need to decide on an overall metric for comparison Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm. So, our metric of choice will be CPU clock cycles

With our metric, we can compare de-ops, but should we? Inevitably, we will ask which de-ops had greater impact, i.e. caused the greatest jump in clocks. So, yes, we should But this has to be done very carefully since an intended de- op may not be the actual or full cause of a bump in clocks. It could be a side effect caused by the new code combination Of course, this would be still be some kind of a de-op, just not the intended de-op

Definition: A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question Is an infinite loop a de-op? -- NO Why not? It tells us nothing about the hardware Is a loop that executes more cycles than necessary a de-op? -- NO Again, it tells us nothing about the CPU Is a combination of instructions that causes increased branch mispredictions a de-op? -- YES

Given some CPU, what aspects can we optimize code for? These aspects will be our focus for de-optimization. In general, when optimizing software, the following are the areas to focus on: Instruction Fetching and Decoding Instruction Scheduling Instruction Type Usage (e.g. Integer vs. FP) Branch Prediction These will be our areas for de-optimization

In class, when we discussed dynamic scheduling, for example, our team was not sanguine about being able to truly de-optimize code In fact, we even imagined that our result may be that CPUs are now generally so good that true de-optimization is very difficult to achieve. In principle, we still believe this In retrospect, we should have been more wise. Just like Platos Forms, there is a significant, if not absolute, difference between something imagined in the abstract and its worldly representation. There can be no perfect circles in the real world Thus, in practice, as Gita has stressed, CPU designers made choices in their designs that were driven by cost, energy consumption, aesthetics, etc.

These choices, when it comes time to write software for a CPU, become idiosyncrasies that must be accounted for when optimizing For those writing optimal code, they are hassles that one must pay attention to For our project team, these idiosyncrasies are potential "gold mines" for de-optimization In fact, the AMD Opteron (K10 architecture) exhibits a number of idiosyncrasies. You will see some these today

AMD Opetron (K10) The dynamic scheduling pick window is 32 bytes length while instructions can be 1 - 16 bytes in length. So, scheduling can be adversely affected by instruction length The branch target buffer (BTB) can only maintain 3 branch history entries per 16 bytes Branch indicators are aligned at odd numbered positions within 16 byte code blocks. So, 1-byte branches like return instructions, if misaligned will be miss predicted

Intel i7 (Nehalem) The number of read ports for the register file is too small. This can result in stalls when reading registers Instruction fetch/decode bandwidth is limited to 16 bytes per cycle. Instruction density can overwhelm the predecoder, which can only manage 6 instructions (per 16 bytes) per cycle

In the upcoming discussion of de-optimization techniques, we will present... ...an area of the CPU that it derives from ...some, hopefully, illuminating title ...a general characterization of the de-op. This characterization may apply to many different CPU architectures. Generally, each of these represents a choice that may be made by a hardware designer ...a specific characterization of the de-op on the AMD Opteron. This characterization will apply only to the Opterons on Hydra

So, without further adieu...

Decoding Bandwidth Execution Latency Instruction Fetching and Decoding

De-optimization #1 - Decrease Decoding Bandwidth [AMD05] Scenario #1 Many CISC architectures offer combined load and execute instructions as well as the typical discrete versions Often, using the discrete versions can decrease the instruction decoding bandwidth Example: add rax, QWORD PTR [foo]

De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In Practice #1 - The Opteron The Opteron can decode 3 combined load-execute (LE) instructions per cycle Using discrete LE instruction will allow us to decrease the decode rate Example: mov rbx, QWORD PTR [foo] add rax, rbx

De-optimization #1 - Decrease Decoding Bandwidth (cont'd) Scenario #2 Use of instruction with longer encoding rather than those with shorter encoding to decrease the average decode rate by decreasing the number of instruction that can fit into the L1 instruction cache This also effectively shrinks the scheduling pick window For example, use 32-bit displacements instead of 8-bit displacements and 2-byte opcode form instead of 1-byte opcode form of simple integer instructions

De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In Practice #2 - The Opteron The Opteron has short and long variants of a number of its instructions, like indirect add, for example. We can use the long variants of these instructions in order to drive down the decode rate This will also have the affect of shrinking the Opterons 32-byte pick window for instruction scheduling. Example of long variant: 81 C0 78 56 34 12 add eax, 12345678h ;2-byte opcode form 83 C3 FB FF FF FF add ebx, -5 ;32-bit immediate value 0F 84 05 00 00 00 jz label1 ;2-byte opcode, 32-bit immediate ;value

De-optimization #1 - Decrease Decoding Bandwidth (cont'd) A balancing act The scenarios for this de-optimization have flip sides that could make them difficult to implement For example, scenario #1 describes using discrete load-execute instructions in order to decrease the average decode rate. However, sometimes discrete load-execute instructions are called for: The discrete load-execute instructions can provide the scheduler with more flexibility when scheduling In addition, on the Opteron, they consume less of the 32-byte pick window, thereby giving the scheduler more options

De-optimization #1 - Decrease Decoding Bandwidth (cont'd) When could this happen? This de-optimization could occur naturally when: A compiler does a very poor job The memory model forces long version encodings of instructions, e.g. 32-bit displacements Our prediction for implementation We predict mixed results when trying to implement this de- optimization

De-optimization #2 - Increase execution latency [AMD05] Scenario CPUs often have instructions that can perform almost the same operation Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized

De-optimization #2 - Increase execution latency In Practice - The Opteron We can use 16-bit LEA instruction, which is a VectorPath instruction to reduce the decode bandwidth and increase execution latency The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization.

De-optimization #2 - Increase execution latency (cont'd) When could this happen? This de-optimization could occur if the user simply does the following: float a, b; b = a / 100.0; instead of: float a, b; b = a * 0.01 Our prediction for implementation We expect this de-op to be clearly reflected in an increase in clock cycles

Address Generation interlocks Register Pressure Loop Re-rolling Instruction Scheduling

De-optimization #1 - Address-generation interlocks [AMD05] Scenario Scheduling loads and stores whose addresses cannot be calculated quickly ahead of loads and stores that require the declaration of a long dependency chain In order to generate their addresses can create address- generation interlocks. Example: add ebx, ecx; Instruction 1 mov eax, DWORD PTR [10h]; Instruction 2 mov edx, DWORD PTR [24h]; Place lode above ; instruction 3 to avoid AGI stall mov ecx, DWORD PTR [eax+ebx]; Instruction 3

De-optimization #1 - Address-generation interlocks (cont'd) In Practice - The Opteron The processor schedules instructions that access the data cache (loads and stores) in program order. By randomly choosing the order of loads and stores, we can seek address-generation interlocks. Example: add ebx, ecx; Instruction 1 mov eax, DWORD PTR [10h]; Instruction 2 (fast address calc.) mov ecx, DWORD PTR [eax+ebx]; Instruction 3 (slow address calc.) mov edx, DWORD PTR [24h] ; This load is stalled from accessing ; the data cache due to the long ; latency caused by generating the ; address for instruction 3

De-optimization #1 - Address-generation interlocks (cont'd) When could this happen? This happen when we have a long chain dependency of loads and stores addresses a head of one that can be calculated quickly. Our prediction for implementation: We expect an increasing in the number of clock cycles by using this de-optimization technique.

De-optimization #2 - Increase register pressure [AMD05] Scenario Avoid pushing memory data directly onto the stack and instead load it into a register to increase register pressure and create data dependencies. Example: In Practice - The Opteron Permit code that first loads the memory data into a register and then pushes it into the stack to increase register pressure and allows data dependencies. Example: push mem mov rax, mem push rax

De-optimization #2 - Increase register pressure When could this happen? This could take place by different usage of instruction load and store, when we have a register and we load an instruction into a register and we push it into a stack Our prediction for implementation: We expect the performance will be affected by increasing the register pressure

De-optimization #3 - Loop Re-rolling Scenario Loops not only affect branch prediction. They can also affect dynamic scheduling How ? Let instructions 1 and 2 be within loops A and B, respectively. 1 and 2 could be part of a unified loop. If they were, then they could be scheduled together. Yet, they are separate and cannot be In Practice - The Opteron Given that the Opteron is 3-way scalar, this de-optimization could significantly reduce IPC

De-optimization #3 - Loop Re-rolling When could this happen? Easily, in C, this would be two consecutive loops each containing one or more many instructions such that the loops could be combined Our prediction for implementation We expect this de-op to be clearly reflected in an increase in clock cycles Example: --- Version 1 --- for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i; cubic_array[i] = i * i * i; } --- Version 2 --- for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i; } for( i = 0; i < n; i++ ) { cubic_array[i] = i * i * i; }

Store-to-load dependency Costly Instruction Instruction Type Usage

De-optimization #1 Store-to-load dependency Scenario Store-to-load dependency takes place when stored data needs to be used shortly. This is commonly used. This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently. Example: for (k=1;k

Documents

Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By: