Upload
cira
View
44
Download
1
Embed Size (px)
DESCRIPTION
Parallel Computing Explained Porting Issues. Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009. Agenda. 1 Parallel Computing Overview - PowerPoint PPT Presentation
Citation preview
Slides Prepared from the CI-Tutor Courses at NCSA
http://ci-tutor.ncsa.uiuc.edu/By
S. Masoud SadjadiSchool of Computing and Information
SciencesFlorida International University
March 2009
Parallel Computing Explained
Porting Issues
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues
3.1 Recompile3.2 Word Length3.3 Compiler Options for Debugging3.4 Standards Violations3.5 IEEE Arithmetic Differences3.6 Math Library Differences3.7 Compute Order Related Differences3.8 Optimization Level Too High3.9 Diagnostic Listings3.10 Further Information
Porting Issues In order to run a computer program that presently runs
on a workstation, a mainframe, a vector computer, or another parallel computer, on a new parallel computer you must first "port" the code.
After porting the code, it is important to have some benchmark results you can use for comparison. To do this, run the original program on a well-defined
dataset, and save the results from the old or “baseline” computer.
Then run the ported code on the new computer and compare the results.
If the results are different, don't automatically assume that the new results are wrong – they may actually be better. There are several reasons why this might be true, including: Precision Differences - the new results may actually be more
accurate than the baseline results. Code Flaws - porting your code to a new computer may have
uncovered a hidden flaw in the code that was already there. Detection methods for finding code flaws, solutions, and
workarounds are provided in this lecture.
RecompileSome codes just need to be recompiled to get accurate
results. The compilers available on the NCSA computer platforms
are shown in the following table:
Language
SGI Origin2000 IA-32 Linux IA-64 Linux
MIPSproPortland Group
Intel GNUPortland Group
Intel GNU
Fortran 77
f77 ifort g77 pgf77 ifort g77
Fortran 90
f90 ifort pgf90 ifort
Fortran 90
f95 ifort ifort
High Performance Fortran
pghpf pghpf
C cc icc gcc pgcc icc gcc
C++ CC icpc g++ pgCC icpc g++
Word LengthCode flaws can occur when you are porting your code
to a different word length computer. For C, the size of an integer variable differs
depending on the machine and how the variable is generated. On the IA32 and IA64 Linux clusters, the size of an integer variable is 4 and 8 bytes, respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the –n32 flag, and 8 bytes if compiled without any flags or explicitly with the –64 flag.
For Fortran, the SGI MIPSpro and Intel compilers contain the following flags to set default variable size.-in where n is a number: set the default INTEGER to
INTEGER*n. The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux clusters.
-rn where n is a number: set the default REAL to REAL*n. The value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters.
Compiler Options for DebuggingOn the SGI Origin2000, the MIPSpro
compilers include debugging options via the –DEBUG:group. The syntax is as follows:-DEBUG:option1[=value1]:option2[=value2]...
Two examples are:Array-bound checking: check for subscripts
out of range at runtime.-DEBUG:subscript_check=ON
Force all un-initialized stack, automatic and dynamically allocated variables to be initialized. -DEBUG:trap_uninitialized=ON
Compiler Options for DebuggingOn the IA32 Linux cluster, the Fortran
compiler is equipped with the following –C flags for runtime diagnostics:-CA: pointers and allocatable references -CB: array and subscript bounds -CS: consistent shape of intrinsic
procedure -CU: use of uninitialized variables -CV: correspondence between dummy
and actual arguments
Standards ViolationsCode flaws can occur when the program has
non-ANSI standard Fortran coding. ANSI standard Fortran is a set of rules for compiler
writers that specify, for example, the value of the do loop index upon exit from the do loop.
Standards Violations DetectionTo detect standards violations on the SGI
Origin2000 computer use the -ansi flag. This option generates a listing of warning
messages for the use of non-ANSI standard coding.
On the Linux clusters, the -ansi[-] flag enables/disables assumption of ANSI conformance.
IEEE Arithmetic DifferencesCode flaws occur when the baseline computer
conforms to the IEEE arithmetic standard and the new computer does not. The IEEE Arithmetic Standard is a set of rules governing
arithmetic roundoff and overflow behavior. For example, it prohibits the compiler writer from
replacing x/y with x *recip (y) since the two results may differ slightly for some operands. You can make your program strictly conform to the IEEE standard.
To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use:f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3.
This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal.
On the Linux clusters, the Intel compilers can achieve conformance to IEEE standard at a stringent level with the –mp flag, or a slightly relaxed level with the –mp1 flag.
Math Library DifferencesMost high-performance parallel computers are
equipped with vendor-supplied math libraries.On the SGI Origin2000 platform, there are SGI/Cray
Scientific Library (SCSL) and Complib.sgimath. SCSL contains Level 1, 2, and 3 Basic Linear Algebra
Subprograms (BLAS), LAPACK and Fast Fourier Transform (FFT) routines.
SCSL can be linked with –lscs for the serial version, or –mp –lscs_mp for the parallel version.
The complib library can be linked with –lcomplib.sgimath for the serial version, or –mp –lcomplib.sgimath_mp for the parallel version.
The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the complete set of LAPACK routines, and Fast Fourier Transform (FFT) routines.
Math Library DifferencesOn the IA32 Linux cluster, the libraries to link to
are: For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide –lpthread
For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl -lguide –lpthread
When calling MKL routines from C/C++ programs, you also need to link with –lF90.
On the IA64 Linux cluster, the corresponding libraries are:For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread
For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack –lmkl_itp –lpthread
When calling MKL routines from C/C++ programs, you also need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins
Compute Order Related DifferencesCode flaws can occur because of the non-deterministic
computation of data elements on a parallel computer. The compute order in which the threads will run cannot be guaranteed. For example, in a data parallel program, the 50th index of a
do loop may be computed before the 10th index of the loop. Furthermore, the threads may run in one order on the first run, and in another order on the next run of the program.
Note: : If your algorithm depends on data being compared in a specific order, your code is inappropriate for a parallel computer.
Use the following method to detect compute order related differences:If your loop looks like DO I = 1, N change it to DO I = N, 1, -1 The results should not change if the iterations
are independent
Optimization Level Too HighCode flaws can occur when the optimization level has
been set too high thus trading speed for accuracy. The compiler reorders and optimizes your code based on
assumptions it makes about your program. This can sometimes cause answers to change at higher optimization level.
Setting the Optimization LevelBoth SGI Origin2000 computer and IBM Linux clusters
provide Level 0 (no optimization) to Level 3 (most aggressive) optimization, using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. Checking correctness and precision of calculation is highly recommended when –O3 is used.
For example on the Origin 2000 f90 -O0 … prog.f turns off all optimizations.
Optimization Level Too HighIsolating Optimization Level Problems
You can sometimes isolate optimization level problems using the method of binary chop.To do this, divide your program prog.f into halves. Name
them prog1.f and prog2.f.Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o prog2.o a.out > results
If the results are correct, the optimization problem lies in prog1.f
Next divide prog1.f into halves. Name them prog1a.f and prog1b.f
Compile prog1a.f with -O0 and prog1b.f with -O3f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o prog1b.o prog2.o a.out > results
Continue in this manner until you have isolated the section of code that is producing incorrect results.
Diagnostic ListingsThe SGI Origin 2000 compiler will
generate all kinds of diagnostic warnings and messages, but not always by default. Some useful listing options are: f90 -listing ...
f90 -fullwarn ... f90 -showdefaults ... f90 -version ... f90 -help ...
Further InformationSGI
man f77/f90/cc man debug_group man math man complib.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals
Linux clusters pagesifort/icc/icpc –help (IA32, IA64, Intel64) Intel Fortran Compiler for Linux Intel C/C++ Compiler for Linux
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning
4.1 Aggressive Compiler Options4.2 Compiler Optimizations4.3 Vendor Tuned Code4.4 Further Information
Scalar TuningIf you are not satisfied with the performance of
your program on the new computer, you can tune the scalar code to decrease its runtime.
This chapter describes many of these techniques:The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code
The detection of cache problems, and their solution are presented in the Cache Tuning chapter.
Aggressive Compiler OptionsFor the SGI Origin2000 Linux clusters
the main optimization switch is-On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations
that will not effect the accuracy of results.
-O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.
Aggressive Compiler OptionsIt should be noted that –O3 might carry out
loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer
obtained from Level 3 optimization with one obtained from a lower-level optimization.
On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels.
On the SGI Origin2000, the option -Ofast = ip27
is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.