Upload
insinfo2008
View
234
Download
0
Embed Size (px)
Citation preview
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
1/61
1CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
High Performance Computing Workshop
Day 1 : October 5, 2004
Uni-Processor Optimization -Code Restructuring and Loop
Optimization Techniques
UniUni--Processor OptimizationProcessor Optimization --Code Restructuring and LoopCode Restructuring and Loop
Optimization TechniquesOptimization Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
2/61
2CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Why do you need to do optimization of a sequential code
Memory hierarchy and how a codes performance depends on it
Optimization Techniques
Loop Optimization Techniques
Collapsing, Fission, Fusion, Unrolling, Interchange, Invariant CodeExtraction, De-factorization, overheads of if-while-goto, NeighborData Dependency
Arithmetic Optimization
Compiler Optimizations Use of tuned Math Libraries
Performance of selective applications and benchmarks
Conclusions
Lecture Outline
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
3/61
3CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Improving Single Processor Performance
How much sustained performance one can achieve for givenprogram on a machine ?
It is programmers job to take advantage as much as possible ofthe CPUs hardware /software characteristics to boost theperformance of the program !
Quite often, just a few simple changes to ones code improvesperformance by a factor of 2, 3 or better !
Also, simply compiling with some of the optimization flags (-O3, -
fast, .) can improve the performance dramatically !
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
4/61
4CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Approximate access times CPU-registers: 0 cycles (thats where the work is done!) L1 Cache: 1 cycle (Data and Instruction cache). Repeated
access to a cache takes only 1 cycle
L2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); 30-60 cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency
IcacheDcache
L2 DISK
RAM
CPU
registers
A lot of time is spent accessing/storing data from/to memory. It isimportant to keep in mind the relative times for each memory types:
The Memory sub-system : Access time
Access Time
is Important
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
5/61
5CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Hierarchical Memory
A four-level memory hierarchy for a large
computer system.
External Cache(SRAMs)
Main Memory
(DRAMs)Disk Storage(Magnetic)
Tap Units
(Magnetic)
M1
M2
M3
M4
Registers, InternalCashes in CPU
Capacity
Level 0
Level 1
Level 2
Level 3
Level 4 I n
c r e
a s e
i n c
a p a c
i t y
a n
d
a c c e s s
t i m e
I n c r e a s e
i n c
o s
t p e r
b i t
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
6/61
6CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Optimization Techniques
Classical Optimization techniques Compiler Does
Memory Reference Optimization Compiler does to some extent
Loop Optimizations Compiler does to some extent
Loop Fission and Loop Fusion
Loop Interchange Loop Alignment Loop Collapsing Loop Unrolling
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
7/61
7CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Collapsing
It attempts to create one (larger) loop out of two or more small ones.
This may be profitable if the size of each of the two loops is too smallfor efficient vectorization, but the resulting single loop can beprofitably vectorized.
REAL A(5,5) B(5,5)
DO 10 J =1, 5DO 10 I=1, 5
A(I,J) = B(I,J) + 2.010 CONTINUE
Before REAL A(25) B(25)
DO 10 JI =1, 25A(JI) = B(JI) +2.0
10 CONTINUE
Loop collapsing is done with multi-dimensional arrays to avoid loopoverheads
After
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
8/61
8CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Using this technique, the code may be transferred into a single loop,regardless of the size of M and N
This may require some additional statement to restart the codeproperly.
DO 10 L = 1, NxM
I = (L-1)/M+1J = MOD(L-1,M) +1)A(I,J) = B(I,J) + 2.0
10 CONTINUE
After
(Contd)Loop Collapsing
DO 10 J =1, NDO 10 I =1, MA(I,J) = B(I,J) + 2.0
10 CONTINUE
General Versions of this technique is useful for computing
systems which support only a single (not nested) DOALLstatement.
Before
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
9/61
9CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Collapsing (Contd)
Loop collapsing is done with multi-dimensional arrays to avoidloop overheads
Assume declaring a[50][80][4]
Un Collapsed Loop
for(I = 0; I
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
10/61
10CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Fusion:
It transforms two adjacent loops into one on the basis ofinformation obtained from data-dependencies analysis.
Two statements will be placed into the same loop if
there is atleast one variable or array which is referredby both.
Remark : Loop Fission and Loop Fusion are related techniques to
Strip mining and loop collapsing
Loop Fission and Loop Fusion
Loop Fission:
Attempts to break a single loop into several loops in order to
optimize data transfer (behavior main memory, cache andregisters)
Primary objective of optimization is data transfer.
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
11/61
11CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
It is merging of several loops into a single loop
Example : Untuned Example : Tuned
Loop Fusion
for(i=0; i < 100000; i++)
x = x * a[i] + b[i];
for(i=0; i < 100000; i++)
y = y * a[i] + c[i];
for(i=0; i < 100000; i++) {x = x * a[i] + b[i];
y = y * a[i] + c[i];}
Tuned code runs atleast 10 times faster on Ultra Sparc (both with O3 flag)
(Contd)
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
12/61
12CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Advantages
The loop overhead is reduced by a factor of two in the abovecase.
Allows for better instruction overlap in loops with dependencies.
Cache misses can be decreased if both loops reference the
same array.
Loop Fusion (Contd)
Disadvantages
Has the potential to increase cache misses if the fused loops
contain references to more than four arrays and the startingelements of those arrays map to the same cache line.
e.g:x = x * a[i] + b[i] * c[i] + d[i] / e[i]
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
13/61
13CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Optimizations : Basic Loop Unrolling
Loop unrolling is performing multiple loop iterations per pass.
Loop unrolling is one of the most important optimizations that canbe done on a pipelined machine.
Loop unrolling helps performance because it fattens up a loop with
calculations that can be done in parallel
Remark : Never unroll an inner loop.
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
14/61
14CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Outer and Inner Loop Unrolling
Remark : The loop or loops in the center are called the innerloops and the surrounding loops are called outer loops
Loopnest: Enabled loops within other created loops
for (i=0; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
15/61
15CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Outer and Inner Loop Unrolling
Reasons for applying outer loop unrolling are:
To expose more computations
To improve memory reference patterns
for(I =0; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
16/61
16CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Unrolling and Sum Reduction Loop Unrolling should be used to reduce data dependency. Different
variables can be used to eliminate the data dependency
a=0.0;
for (i=0; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
17/61
17CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Qualifying Candidates for Loop Unrolling
The previous example is an ideal candidate for loop unrolling.
Study categories of loops that are generally not prime candidatesfor unrolling.
Loops with low trip counts
Fat loops
Loops containing branches
Recursive loops
Vector reductions
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
18/61
18CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Qualifying Candidates for Loop Unrolling To be effective, loop unrolling requires that there be a fairly large
number of iterations in the original loop.
When a trip count in loop is low, the preconditioning loop is doingproportionally large amount of work.
Loop containing procedure calls
Loop containing subroutine or function calls generally are not good
candidates for unrolling. First : They often contain a fair number of instructions already. The
function call can cancel many more instructions.
Second : When the calling routine and the subroutine are compiledseparately, it is impossible for the compiler to intermix instructions.
Last : Function call overhead is expensive. Registers have to besaved, argument lists have to be prepared.The time spent callingand returning from a subroutine can be much greater than that ofthe loop overhead.
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
19/61
19CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
II=IMOD (N,4)
DO 9 I=1, II
CALL SHORT (A(I),B(I),C)9 CONTINUE
DO 10 I=1+II, N,4CALL SHORT(A(I),B(I),C)
CALL SHORT(A(I+1),B(I+1),C)CALL SHORT(A(I+2),B(I+2),C)CALL SHORT(A(I+3),B(I+3),C)
10 CONTINUE
(Contd)
DO 10 I=1, NCALL SHORT(A(I), B(I),C)
10 CONTINUESUBROUTINE SHORT
(A,B,C)A = A+B+C
RETURNEND
Qualifying Candidates for Loop Unrolling
Loop containing procedure calls is not suitable forunrolling
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
20/61
20CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
If a particular loop is already fat, then unrolling is not going to helpmuch and loop overhead will spread over a fair number ofinstructions.
A good rule of thumb is to look elsewhere for performance when
the loop inwards exceed three or four statements.
Since code indicates that inlining is feasible.
Qualifying Candidates for Loop Unrolling
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
21/61
21CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Qualifying Candidates for Loop Unrolling
Original
Dependency can be reduced by deriving new set of recursive equations
Decreasing the dependencies at the expense of creating more work.
DO 10 I=2, NA(I) = A(I) + A(I-1) x B
10 CONTINUE
Modified
DO 10 I =2, N,2A(I) = A(I+1) + A(I-1) * B + A(I-1) *B*BA(I) = A(I) + A(I-1)*B
10 CONTINUE
This is an example of vector recursion
A Good compiler can make the rolled up version go faster by
recognizing the dependency as opportunity to save memory traffic.
A(I) = A(I)+A(I-1)*BA(I+1) = A(I+1)+A(I)*BA(I+2) = A(I+2)+A(I+1)*BA(I+3) = A(I+3)+A(I+2)*B
Recursive Loops (Contd..)
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
22/61
22CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Negatives of Loop Unrolling
Loop unrolling always adds some run time to the program.
If you unroll a loop and see the performance dip little, you canassume that either:
The loop wasnt a good candidate for unrolling in the first placeor
A secondary effort absorbed your performance increase.
Other possible reasons
Unrolling by the wrong factor
Register spitting
Instruction cache miss
Other hardware delays
Outer loop unrolling
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
23/61
23CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop Interchange
Loop interchange is a technique for rearranging a loop nest so thatthe right stuff at the center. What is the right stuff depends upon
what you are trying to accomplish.
Loop interchange to move computations to the center of the loopnest.
It is also good for improving memory access patterns.
Iterations on the wrong subscript can cause a large stride and hurt
your performance.
Inverting the loops, so that the iterating variables causing the lesserstrides are in the center, you can get performance win.
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
24/61
24CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
PARAMETER(IDIM=1000,JDIM=10
00, KDIM = 4)DO 10 K=1, KDIM
DO 20 J=1, JDIMDO 30 I=1, IDIM
D(I,J,K)=D(I,J,K)+ V(I,J,K)*DT30 CONTINUE20 CONTINUE10 CONTINUE
Loop Interchange
PARAMETER(IDIM=1000,JDIM=1
000, KDIM=4)DO 10 I =1, IDIMDO 20 J =1, JDIM
DO 30 K =1, KDIMD(I,J,K)=D(I,J,K)+
V(I,J,K)*DT30 CONTINUE20 CONTINUE10 CONTINUE
Loop interchange to move computations to the center
Frequently, the interchange of nested loops permits a significantincrease in the amount of parallelism
Example is straight forward: it is easy to see that there are no interiteration dependencies.
(Contd)
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
25/61
25CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
float a[2][40][2000]
for(i=0; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
26/61
26CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Statements that do not change within an inner loop can be moved
outside of the loop. (Compiler optimizations can usually detect these).
for(i=0 ; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
27/61
27CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop De-factorization consists of removing commonmultiplicative factors outside of inner loops
for(i=0; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
28/61
28CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Untuned Loops (IFs and GOTOs): Turned Loop : I=0 I = 0
10 I = I + I 10 I = I + 1IF(I.GT.100000)GOTO 30 A(I)=A(I)+B(I)*C(I)A(I)=A(I) + B(I)*C(I) IF(I.LE.100000)GOTO 10
GOTO 1030 CONTINUE
Another Untuned Loop (WHILE Loop) : Turned Loop:
I = 0 DO I = 1, 100000DO WHILE (I .LT. 100000) A(I) = A(I)+B(I)*C(I)I = I + 1 END DO
A(I) = A(I)+B(I)*C(I)ENDDO
Avoid IF/GOTO loops and WHILE loops. They inhibit compileroptimizations and they introduce unnecessary overheads.
Loop Optimization: IF, WHILE, and DO Loops
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
29/61
29CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Example: data wrap around, untuned versionjwrap = ARRAY_SIZE 1;for(i=0; i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
30/61
30CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
DO 10, JB = 1, N, NB
DO 10, IB = 1, N, NBDO 10, KB = 1, N, NBDO 10, J = JB, JB + NB 1
DO 10, I = IB, IB + NB 1DO 10, K = KB, KB + NB 1
C (I, J) = C (I, J) + A (I, K) * B(K,J)10 CONTINUEThis is most useful as a simple example of cache blocking. Mostcompilers will automatically cache block the original code as part of
ordinary optimization.
Programming Techniques Managing the Cache
DO 10, J = 1, NDO 10, I = 1, N
DO 10, K = 1, NC(I, J) = C (I, J) + A (I, K) * B (K, J)
10 CONTINUE
We can modify the previous code to better use the cache.
Original code
Modified code
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
31/61
31CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop optimizations accomplish three things :
Reduce loop overhead
Increase Parallelism
Improve memory performance patterns
Understanding your tools and how they work is critical for using themwith peak effectiveness. For performance, a compiler is your bestfriend.
Loop Optimizations: Advantages
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
32/61
32CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Replace frequent divisions by inverse multiplications Multiplications/divisions by integer powers of 2 can be replaced by
bit shifts to the left/right (compilers can usually do this) Small integer exponentials such as an should be replaced by
repeated multiplications a*a*a*a.(compilers will usually do this) Reorganize (or eliminate) repeated (or useless) operation: Use Horners rule to evaluate polynomials.
Recap of Arithmetic Optimization
Example :
Ax5 + Bx4 + Cx3 + Dx2 + E x + F can be written as
((((Ax + B)* x + C)*x+D)* x + E)* x+F
This saves more time in C (speed increases by factor greater than10) than in Fortran (improvement of only about 30%) due to the wayC language handles (poorly ) the function pow(x,5).
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
33/61
33CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Compiler Optimizations
Compiler optimization From Wikipedia, the free encyclopedia
Compiler optimization is used to improve the efficiency (in terms of
running time or resource usage) of the executables output by a
compiler.
Allow programmers to write source code in a straightforwardmanner, expressing their intentions clearly, while allowing thecomputer to make choices about implementation details that leadto efficient execution.
May or may not result in executables that are perfectly "optimal" byany measure
Ref: http://en.wikipedia.org/wiki/Compiler_optimization
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
34/61
34CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Sun Workshop Compiler 6.2
- O : Set optimization level
- fast : Select a set of flags likely to improve speed
- stackvar : put local variables on stack
- xlibmopt : link optimized libraries
- xarch : Specify instruction set architecture
- xchip : Specifies the target processor for use by theoptimizer.
- native : Compile for best performance on localhost.
- xprofile : Collects data for a profile or uses a profile to
optimize.
- fns : Turns on the SPARC nonstandard floating-point
mode.- xunroll n : Unroll loops n times.
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
35/61
35CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
-O Optimize at the level most likely to give close to the maximumperformance for many realistic applications (currently -O3)
-O1 Do only the basic local optimizations (peephole).-O2 Do basic local and global optimization. This level usually gives
minimum code size.
-O3 Adds global optimizations at the function level. In general, this level,and -O4, usually result in the minimum code size when used with the
-xspace option.-O4 Adds automatic inlining of functions in the same file. -g suppresses
automatic inlining.
-O5 Does the highest level of optimization, suitable only for the smallfraction of a program that uses the largest fraction of computer time.
Uses optimization algorithms that take more compilation time or thatdo not have as high a certainty of improving execution time.Optimization at this level is more likely to improve performance if it isdone with profile feedback. See -xprofile=collect|use.
Basic Compiler Techniques : Optimizations
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
36/61
36CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
- stackvar
Tells the compiler to put most variables on the stack rather than
statically allocate them. - stackvar is almost always a good idea, and it is crucial when
parallelization.
You can control stack versus static allocation for each variable.
Variables that appear in DATA, COMMON, SAVE, orEQUIVALENCE statements will be static regardless of whetheryou specify -stackvar.
Basic Compiler Techniques : Local variables on the Stack
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
37/61
37CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Basic Compiler Techniques
-xchip
Specifies the target chip. Specifying the chip lets the compiler know
that certain implementation details such as specific instructionstimings, number of functional units etc.
-xarch
Specifies the target architecture. A target architecture includes theinstruction set but may not include implementation details such asinstruction timing.
-xarch
= v8plus on Sun produces an executable file that will take full
advantage of some UltaSPARC features.
-native
Directs the compiler to produce the best executable (performance)that it can for the system on which the program is being compiled.
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
38/61
38CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Basic Compiler Techniques
- fast
Run program with a reasonable level of optimization may changeits meaning on different machines.
It strikes balance between speed, portability, and safety.
-fast is often a good way to et a first-cut approximation of how
fast your program can run with a reasonable level of optimization
-fast should not be used to build the production code.
The meaning of fast will often change from one release to
another
As with native, -fast may change its meaning on
different machines
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
39/61
39CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Basic Compiler Techniques
- fsimple: (simple floating point model)
Tells the compiler to use a floating point system that includes onlynumbers.
- xvector : Vectorization enables the compiler to transform vectorizable loops
from scalar to vector form. It is generally faster and slower forshort vectors
- xlibmil: Tells the compiler to inline certain mathematical operations such
as floor, ceiling, and complex absolute value
- xlibmopt:
Tells the linker to use an optimized math library. This mayproduce slightly different answer than the regular math library
These libraries may get their speed by sacrificing accuracy
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
40/61
40CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Advanced Compiler Techniques
- xcrossfile
Enables the compiler to optimize and inline source code across
different files. It may compile code to be optimal for the files that are complied
together Produces very fast executable
- xpad
Directs the compiler to insert padding (unused space) betweenadjacent variables in common blocks and local variables to try toimprove cache performance.
C
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
41/61
41CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Using Your Compiler Effectively - Classical Optimizations
The compiler performs the classical optimizations, plus number ofarchitecture specific optimizations.
Copy propagation
Constant Folding
Dead Code removal
Strength reduction
Induction Variable Elimination
Common Sub-expression Elimination
HPC W k h 2004HPC W k h 2004O ti i ti T h i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
42/61
42CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
The compiler performs the classical optimizations, plus number of
architecture specific optimizations. Loop in-variant code motion.
Induction variable simplification
Register variable detection
Inlining
Loop Fusion
Loop Unrollling
Classical Optimizations
HPC W k h 2004HPC Workshop 2004O ti i ti T h i
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
43/61
43CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Copy propagation
Copy propagation is an optimization that occurs both locally andglobally.
x=yz=1.0+x
Compiler may be able to perform copy propagation a cross the flowgraph.
x=yz=1.0+y
PROGRAM MAIN
INTEGER I, K
PARAMETER (I=200)
K=200
J=I+K
END
Constant Folding
A clever compiler can find constantsthroughout your program.
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
44/61
44CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Dead Code Removal
Dead code comes in two types.
Instructions that are unreachable.
Instructions that produce results whichone never used.
Program main
i=2
write (x,x)istop
i=4
write (x,x)i
endStrength Reduction
Operations or expressions havevarious time costs associated withthem.
There are many opportunities forcompiler generated strengthreductions.
Y=X*2
J=Kx2
Y=X*X
J=K+K
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
45/61
45CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Variable Renaming
Example: Observe variable in the following fragment of code.
x = y x zq = r+x+x
x = a+b
Variable renaming is an important technique because it clarifies that
calculations are independent of each other, which increases thenumber of things that can be done in parallel.
Common sub expression Elimination
D=Cx(A+B)E=(A+B/2)
Different computer go to different lengths to find common sub expression
xx = y x zq = r+xx+xx
x = a+b
Temp=A+BD=C X temp
E=temp p/2
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
46/61
46CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Loop invariant code Motion:
The compiler will look for every opportunity to move calculations out of
a loop and into the surrounding.
Loop invariant code motion is simply the act of moving the repeated,unchanging calculations to the outside.Induction Variable Simplification:
Loop can contain what are called induction variables.
DO 10 I=1,N
A(I)=B(I)+CxD
E=G(K)
10 CONTINUE
DO 10 I=1,N
K=I*4+M
10 CONTINUE
temp=CxDDO 10 I=1,N
A(I)=B(I)+temp
10 CONTINUE
E=G(K)
K=M
DO 10 I=1,N
K=K+4
10 CONTINUE
Classical Optimizations(Contd..)
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
47/61
47CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
SUM=0.0DO 10 I=1, N
SUM=SUM+A(I)xB(I)
10 CONTINUE
Example: Dot product of two vectors
SUM=0.0
DO 10 I=1,N,4
SUM = SUM+A(I)xB(I)+A(I+1)*B(I+1) +
A(I+2)*B(I+2)+A(I+2)*B(I+3)
10 CONTINUE
The loop is recursive on that singlevariable, every iteration needs theresult of the previous iteration.
The assignment is being made to a scalar, unrolling isnt as straightforward as before. Obvious way is to calculate several iteration at a
time.
Associative Transformations and Reductions
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
48/61
48CopyrightC-DAC2004 October5-9, 2004
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
Dependency analysis is a technique where by the syntacticconstructs of a program are analyzed with the aim of determining
whether certain values may depend on other previously computedvalues.
The real objective of dependence analysis is to determine whethertwo statements are independent of each other
Example: S1 A=C-AS2 A=B+C
S3 B=A+C
DO ALL transformations: This transformation converts everyiteration of a loop into process that is independent of all others
It assumes that there are no loop-carried dependencies.
The DO ALL transformation is very efficient if it can be applied.However, many loops carry dependencies.
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
49/61
49CopyrightC-DAC2004 October5-9, 2004
C o s op 00pOpt sat o ec ques
Register Variable Detection
On many CISC processors there were few general purposeregisters.
On RISC designs, there are many more registers to choose from,and everything has to be brought into a register anyway.
All variables will be registers resident.
The new challenge is determine which variables should live the
greater portion of their lives in registers.
The compiler performs the classical optimizations, plus a number ofarchitecture-specific optimizations.
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
50/61
50CopyrightC-DAC2004 October5-9, 2004
ppp q
Inlining
Inlining is the substitution of the body of a subprogram for the call of
that subprogram. This eliminates function call overhead. To enable inlining by the Sun compilers, use fast or xO4
f77 fast a.f Loop Fusion :
Loop fusion is the process of fusing two adjacent loops with thesame loop bounds, which is usually a Good Thing
Induction Values:
Induction values that can be computed as a function of the loop
count variable and possibly other values.
Classical Optimizations
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
51/61
51CopyrightC-DAC2004 October5-9, 2004
pp q
Parallel programming-Compilation switchesAutomatic and directives based parallelization
Allow compiler to do automatic and directive based parallelization
-x autopar, -x explicitpar, -x parallel, -tell the compilerto parallelize your program.
xautopar: tells the compiler to do only those parallelization that itcan do automatically
xexplicitpar: tells the compiler to do only those parallelizationthat you have directed it to do with programs in the source
xparallel: tells the compiler to parallelize both automaticallyand under pragma control
xreduction: tells the compiler that it may parallelize reductionloops. A reduction loop is a loop that produces output with smallerdimension than the input.
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
52/61
52CopyrightC-DAC2004 October5-9, 2004
Parallel Programming Compiler switches
Remarks
In some cases, parallelizing a reduction loop can give differentanswers depending on the number of processors on which the loop isrun.
Compiler directives can usually over come artificial barriers to
parallelization.
Compiler directives can also overcome legitimate barriers toparallelization, which introduces errors.
The efficiency and effectiveness of automatic compiler parallelizationcan be significantly improved by supplying the switches.
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
53/61
53CopyrightC-DAC2004 October5-9, 2004
BLAS, IMSL, NAG, LINPACK, ScaLAPACK LAPACK, etc.
Calls to these math libraries can often simplify coding.
They are portable across different platform
They are usually fine-tuned to the specific hardware as well as tothe sizes of the array variables that are sent to them
Example : Sun performance libraries (-xlic_lib=sunperf), IBM ESSL,ESSLSMP
Use of MATH LIBRARIES
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
54/61
54CopyrightC-DAC2004 October5-9, 2004
Optimization of unsteady state 3D Compressible Navier-Stokesequations by finite difference method
Computing System used : Sun Ultra Sparc workstation (Each nodeis quad CPU Ultra Enterprise 450 server, operating at 300Mhz)
Grid Size Iterations Time in seconds
192*16*16 1000 (No compiler options) 4930
192*16*16
1000 (Code restructuring and
compiler optimization)
2620
192*16*16 680
1000 (with compiler optimization)
Conclusions : Re-structuring the code and use of proper compileroptimizations reduces the execution time by a factor of 8.0
Performance of selective application - CFDPerformance of selective application - CFD
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
55/61
55CopyrightC-DAC2004 October5-9, 2004
4-way SMP
POWER 4 1.0 Ghz
8 GB Main memory (16 GB Max)
AIX 5.1 and PPC Linux
XL F77, F90, C, C++
Performance Libraries: BLAS 1,2,3 BLACS, ESSL
32-way SMP
POWER 4 1.1 Ghz
64 GB Main memory (256 GB Max) AIX 5.1 and PPC Linux
XL F77, F90, C, C++
Performance Libraries: BLAS 1,2,3 BLACS, ESSL
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
56/61
56CopyrightC-DAC2004 October5-9, 2004
LLCBench: Performance on IBM p630
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
57/61
57CopyrightC-DAC2004 October5-9, 2004
LLCBench: Performance on IBM p690
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
58/61
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
59/61
59CopyrightC-DAC2004 October5-9, 2004
Reducing Memory Overheads is important for performance ofsequential and parallel programs
Minimization of memory traffic is the single most important goal.
For multiple dimensional arrays, access will be fastest if you iterateon the array subscript offering the smallest stride or step size.
Role of Data Reuse on Memory sub-system will increase the
performance
Basic Compiler and Advanced Compiler Optimization flags can beused for performance
Write code so that a compiler find it easy to locate optimizations
Compiler performs Classical Optimization Techniques and someloop optimization techniques
Conclusions
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
60/61
60CopyrightC-DAC2004 October5-9, 2004
1. Ernst L. Leiss, Parallel and Vector Computing A practical Introduction, McGraw-HillSeries on Computer Engineering, Newyork (1995).
2. Albert Y.H. Zomaya, Parallel and distributed Computing Handbook, McGraw-HillSeries on Computing Engineering, Newyork (1996).
3. Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to ParallelComputing, Design and Analysis of Algorithms, Redwood City, CA,Benjmann/Cummings (1994).
4. William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh
(1996)5. Ian T. Foster, Designing and Building Parallel Programs, Concepts and tools for Parallel
Software Engineering, Addison-Wesley Publishing Company (1995).
6. Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology Architecture
Programming) McGraw Hill Newyork (1997)
7. Culler David E, Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture,A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc, (1999)
References
HPC Workshop 2004HPC Workshop 2004Optimisation Techniques
8/2/2019 Day01 Hpc Wrkshp Compiler Opt
61/61
61CopyrightC-DAC2004 October5-9, 2004