Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware...
56
DE-OPTIMIZATION Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:
Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:
Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed
Finding the Limits of Hardware Optimization through Software
De-optimization Presented By:
Slide 2
Introduction Project Structure Judging de-optimizations What
does a de-op look like? General Areas of Focus Instruction Fetching
and Decoding Instruction Scheduling Instruction Type Usage (e.g.
Integer vs. FP) Branch Prediction Conclusion
Slide 3
De-optimization? That's crazy! Why??? In the world of hardware
development, when optimizations are compared, the comparisons often
concern just how fast a piece of hardware can run an algorithm Yet,
in the world of software development, the hardware is often a
distant afterthought Given this dichotomy, how relevant are these
standard analyses and comparisons?
Slide 4
So, why not find out how bad it can get? By de-optimizing
software, we can see how bad algorithmic performance can be if
hardware isn't considered At a minimum, we want to be able to
answer two questions: How good of a compiler writer must someone
be? How good of a programmer must someone be?
Slide 5
For our research project: We have been studying instruction
fetching/ decoding/ scheduling and branch optimization We have been
using knowledge of optimizations to design and predict
de-optimizations We have been studying the Opteron in detail
Slide 6
For our implementation project: We will choose de-optimizations
to implement We will choose algorithms that may best reflect our
de- optimizations We will implement the de-optimizations We will
report the results
Slide 7
We need to decide on an overall metric for comparison Whether
the de-op affects scheduling, caching, branching, etc, its impact
will be felt in the clocks needed to execute an algorithm. So, our
metric of choice will be CPU clock cycles
Slide 8
With our metric, we can compare de-ops, but should we?
Inevitably, we will ask which de-ops had greater impact, i.e.
caused the greatest jump in clocks. So, yes, we should But this has
to be done very carefully since an intended de- op may not be the
actual or full cause of a bump in clocks. It could be a side effect
caused by the new code combination Of course, this would be still
be some kind of a de-op, just not the intended de-op
Slide 9
Definition: A de-op is a change to an optimal implementation of
an algorithm that increases the clock cycles needed to execute the
algorithm and that demonstrates some interesting fact about the CPU
in question Is an infinite loop a de-op? -- NO Why not? It tells us
nothing about the hardware Is a loop that executes more cycles than
necessary a de-op? -- NO Again, it tells us nothing about the CPU
Is a combination of instructions that causes increased branch
mispredictions a de-op? -- YES
Slide 10
Given some CPU, what aspects can we optimize code for? These
aspects will be our focus for de-optimization. In general, when
optimizing software, the following are the areas to focus on:
Instruction Fetching and Decoding Instruction Scheduling
Instruction Type Usage (e.g. Integer vs. FP) Branch Prediction
These will be our areas for de-optimization
Slide 11
In class, when we discussed dynamic scheduling, for example,
our team was not sanguine about being able to truly de-optimize
code In fact, we even imagined that our result may be that CPUs are
now generally so good that true de-optimization is very difficult
to achieve. In principle, we still believe this In retrospect, we
should have been more wise. Just like Platos Forms, there is a
significant, if not absolute, difference between something imagined
in the abstract and its worldly representation. There can be no
perfect circles in the real world Thus, in practice, as Gita has
stressed, CPU designers made choices in their designs that were
driven by cost, energy consumption, aesthetics, etc.
Slide 12
These choices, when it comes time to write software for a CPU,
become idiosyncrasies that must be accounted for when optimizing
For those writing optimal code, they are hassles that one must pay
attention to For our project team, these idiosyncrasies are
potential "gold mines" for de-optimization In fact, the AMD Opteron
(K10 architecture) exhibits a number of idiosyncrasies. You will
see some these today
Slide 13
AMD Opetron (K10) The dynamic scheduling pick window is 32
bytes length while instructions can be 1 - 16 bytes in length. So,
scheduling can be adversely affected by instruction length The
branch target buffer (BTB) can only maintain 3 branch history
entries per 16 bytes Branch indicators are aligned at odd numbered
positions within 16 byte code blocks. So, 1-byte branches like
return instructions, if misaligned will be miss predicted
Slide 14
Intel i7 (Nehalem) The number of read ports for the register
file is too small. This can result in stalls when reading registers
Instruction fetch/decode bandwidth is limited to 16 bytes per
cycle. Instruction density can overwhelm the predecoder, which can
only manage 6 instructions (per 16 bytes) per cycle
Slide 15
In the upcoming discussion of de-optimization techniques, we
will present... ...an area of the CPU that it derives from ...some,
hopefully, illuminating title ...a general characterization of the
de-op. This characterization may apply to many different CPU
architectures. Generally, each of these represents a choice that
may be made by a hardware designer ...a specific characterization
of the de-op on the AMD Opteron. This characterization will apply
only to the Opterons on Hydra
Slide 16
So, without further adieu...
Slide 17
Decoding Bandwidth Execution Latency Instruction Fetching and
Decoding
Slide 18
De-optimization #1 - Decrease Decoding Bandwidth [AMD05]
Scenario #1 Many CISC architectures offer combined load and execute
instructions as well as the typical discrete versions Often, using
the discrete versions can decrease the instruction decoding
bandwidth Example: add rax, QWORD PTR [foo]
Slide 19
De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In
Practice #1 - The Opteron The Opteron can decode 3 combined
load-execute (LE) instructions per cycle Using discrete LE
instruction will allow us to decrease the decode rate Example: mov
rbx, QWORD PTR [foo] add rax, rbx
Slide 20
De-optimization #1 - Decrease Decoding Bandwidth (cont'd)
Scenario #2 Use of instruction with longer encoding rather than
those with shorter encoding to decrease the average decode rate by
decreasing the number of instruction that can fit into the L1
instruction cache This also effectively shrinks the scheduling pick
window For example, use 32-bit displacements instead of 8-bit
displacements and 2-byte opcode form instead of 1-byte opcode form
of simple integer instructions
Slide 21
De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In
Practice #2 - The Opteron The Opteron has short and long variants
of a number of its instructions, like indirect add, for example. We
can use the long variants of these instructions in order to drive
down the decode rate This will also have the affect of shrinking
the Opterons 32-byte pick window for instruction scheduling.
Example of long variant: 81 C0 78 56 34 12 add eax, 12345678h
;2-byte opcode form 83 C3 FB FF FF FF add ebx, -5 ;32-bit immediate
value 0F 84 05 00 00 00 jz label1 ;2-byte opcode, 32-bit immediate
;value
Slide 22
De-optimization #1 - Decrease Decoding Bandwidth (cont'd) A
balancing act The scenarios for this de-optimization have flip
sides that could make them difficult to implement For example,
scenario #1 describes using discrete load-execute instructions in
order to decrease the average decode rate. However, sometimes
discrete load-execute instructions are called for: The discrete
load-execute instructions can provide the scheduler with more
flexibility when scheduling In addition, on the Opteron, they
consume less of the 32-byte pick window, thereby giving the
scheduler more options
Slide 23
De-optimization #1 - Decrease Decoding Bandwidth (cont'd) When
could this happen? This de-optimization could occur naturally when:
A compiler does a very poor job The memory model forces long
version encodings of instructions, e.g. 32-bit displacements Our
prediction for implementation We predict mixed results when trying
to implement this de- optimization
Slide 24
De-optimization #2 - Increase execution latency [AMD05]
Scenario CPUs often have instructions that can perform almost the
same operation Yet, in spite of their seeming similarity, they have
very different latencies. By choosing the high-latency version when
the low latency version would suffice, code can be
de-optimized
Slide 25
De-optimization #2 - Increase execution latency In Practice -
The Opteron We can use 16-bit LEA instruction, which is a
VectorPath instruction to reduce the decode bandwidth and increase
execution latency The LOOP instruction on the Opteron has a latency
of 8 cycles, while a test (like DEC) and jump (like JNZ) has a
latency of less than 4 cycles Therefore, substituting LOOP
instructions for DEC/JNZ combinations will be a
de-optimization.
Slide 26
De-optimization #2 - Increase execution latency (cont'd) When
could this happen? This de-optimization could occur if the user
simply does the following: float a, b; b = a / 100.0; instead of:
float a, b; b = a * 0.01 Our prediction for implementation We
expect this de-op to be clearly reflected in an increase in clock
cycles
De-optimization #1 - Address-generation interlocks [AMD05]
Scenario Scheduling loads and stores whose addresses cannot be
calculated quickly ahead of loads and stores that require the
declaration of a long dependency chain In order to generate their
addresses can create address- generation interlocks. Example: add
ebx, ecx; Instruction 1 mov eax, DWORD PTR [10h]; Instruction 2 mov
edx, DWORD PTR [24h]; Place lode above ; instruction 3 to avoid AGI
stall mov ecx, DWORD PTR [eax+ebx]; Instruction 3
Slide 29
De-optimization #1 - Address-generation interlocks (cont'd) In
Practice - The Opteron The processor schedules instructions that
access the data cache (loads and stores) in program order. By
randomly choosing the order of loads and stores, we can seek
address-generation interlocks. Example: add ebx, ecx; Instruction 1
mov eax, DWORD PTR [10h]; Instruction 2 (fast address calc.) mov
ecx, DWORD PTR [eax+ebx]; Instruction 3 (slow address calc.) mov
edx, DWORD PTR [24h] ; This load is stalled from accessing ; the
data cache due to the long ; latency caused by generating the ;
address for instruction 3
Slide 30
De-optimization #1 - Address-generation interlocks (cont'd)
When could this happen? This happen when we have a long chain
dependency of loads and stores addresses a head of one that can be
calculated quickly. Our prediction for implementation: We expect an
increasing in the number of clock cycles by using this
de-optimization technique.
Slide 31
De-optimization #2 - Increase register pressure [AMD05]
Scenario Avoid pushing memory data directly onto the stack and
instead load it into a register to increase register pressure and
create data dependencies. Example: In Practice - The Opteron Permit
code that first loads the memory data into a register and then
pushes it into the stack to increase register pressure and allows
data dependencies. Example: push mem mov rax, mem push rax
Slide 32
De-optimization #2 - Increase register pressure When could this
happen? This could take place by different usage of instruction
load and store, when we have a register and we load an instruction
into a register and we push it into a stack Our prediction for
implementation: We expect the performance will be affected by
increasing the register pressure
Slide 33
De-optimization #3 - Loop Re-rolling Scenario Loops not only
affect branch prediction. They can also affect dynamic scheduling
How ? Let instructions 1 and 2 be within loops A and B,
respectively. 1 and 2 could be part of a unified loop. If they
were, then they could be scheduled together. Yet, they are separate
and cannot be In Practice - The Opteron Given that the Opteron is
3-way scalar, this de-optimization could significantly reduce
IPC
Slide 34
De-optimization #3 - Loop Re-rolling When could this happen?
Easily, in C, this would be two consecutive loops each containing
one or more many instructions such that the loops could be combined
Our prediction for implementation We expect this de-op to be
clearly reflected in an increase in clock cycles Example: ---
Version 1 --- for( i = 0; i < n; i++ ) { quadratic_array[i] = i
* i; cubic_array[i] = i * i * i; } --- Version 2 --- for( i = 0; i
< n; i++ ) { quadratic_array[i] = i * i; } for( i = 0; i < n;
i++ ) { cubic_array[i] = i * i * i; }
Slide 35
Store-to-load dependency Costly Instruction Instruction Type
Usage
Slide 36
De-optimization #1 Store-to-load dependency Scenario
Store-to-load dependency takes place when stored data needs to be
used shortly. This is commonly used. This type of dependency
increases the pressure on the load and store unit and might cause
the CPU to stall especially when this type of dependency occurs
frequently. Example: for (k=1;k