IBM XL Compilers Performance Tuning 2016-11-18

© 2016 IBM Corporation

Performance Tuning withIBM XL C/C++/Fortran Compilers and Libraries

Yaoqing Gao [email protected] Technical Staff Member

IBM Canada Lab

November 2016

@IBM_compilers1 IBM C/C++ and Fortran Compilers and Libraries

§ Overview of IBM XL C/C++ and Fortran Compilers§ Performance Analysis

§ Performance tools§ Hot spots and bottleneck detection

§ Performance Tuning§ Summary

Agenda


Overview of IBM XL C/C++ and Fortran Compilers and Libraries


IBM XL C/C++ and Fortran Compilers

Easy migration• C/C++ language standard conformance; Fortran 2003 compliance, selected

Fortran 2008 features; OpenMP 3.1 compliance, selected OpenMP 4.0/4.5 features, CUDA C/C++ and Fortran

• Full binary compatibility with GCC• Option and source compatibility with GCC and Clang

Industry leading performance• Full enablement and exploitation of the latest Power hardware• Leading edge advanced compiler optimization technologies• Optimized math libraries• 10 – 30% better than open source compilers for typical workloads

Agility• Flexibility/speed in delivery schedules.• Superior service and support• Successful customer engagements


Advanced Optimization Technology

• Full platform exploitation– Enable and exploit POWER hardware features

• Loop transformation– Analyze and transform loops to improve performance

• Automatic SIMDization/Vectorization– Convert operations to allow for several calculations to occur simultaneously

• Parallelization– Automatic parallelization and explicit parallelization through OpenMP

• Optimized Math libraries– Scalar MASS library and vector MASSV library tuned for POWER

• IPA (Inter-Procedural Analysis)– Apply optimization techniques to entire programs

• PDF (Profile-directed feedback)– Tune application performance for typical usage scenarios


Open Source & GCC Affinity

GCC/Clang

calls

C/C++Object/Binary

C/C++Object/Binary

XLC

C/C++

MakefileMakefile

migration

§ Options, source compatibility

C/C++

§ Binary compatibility

Linker

§ Clang adoption


Compiling Application with XLC and XLF

§ Check the compiler release and versionxlc –qversionxlC -qversionxlf –qversion

§ Compile an applicationxlc for C code; xlC for C++ codexlf, xlf90, xlf95, xlf2003, xlf2008 for Fortran code

§ Specify compile options• -O3 or -O3 –qhot for floating point computation intensive application• -O3 or -O3 –qipa for integer application• –qsmp=omp for OpenMP application

xlc -qversionIBM XL C/C++ for Linux, V13.1.4 Version: 13.01.0004.0000


CUDA C/C++

§ NVCC can use XLC as host compiler for POWER CPU• NVCC is NVIDIA CUDA C++ Compiler from NVIDIA CUDA Toolkit• NVCC partitions C/C++ source code into CPU and GPU portions

§ Detailed instructions for using XLC• Red Book:

http://www.redbooks.ibm.com/redpapers/pdfs/redp5169.pdf

§ Invocation example• nvcc -ccbin xlC -m64 -Xcompiler -O3 -Xcompiler -q64 -Xcompiler

-qsmp=omp -gencode arch=compute_20,code=sm_20 -o cudaOpenMP.o -c cudaOpenMP.cu

• nvcc -ccbin xlC -m64 -Xcompiler -O3 -Xcompiler -q64 -o cudaOpenMP cudaOpenMP.o -lxlsmp


CUDA Fortran§ What’s CUDA Fortran

• Created by PGI and NVIDIA in 2009-2010• Functionally equivalent to CUDA C• Provides seamless integration of CUDA into Fortran declarations

and statements• CUDA runtime API also available from CUDA Fortran• Fortran modules provide bind(c) interface for CUDA C libraries

§ XL Fortran V15.1.4 support for CUDA Fortran• Supports commonly used subset of CUDA Fortran features• Programs benefit from industry-leading POWER CPU optimization,

while exploiting GPU performance

§ Invocation example• xlcuf –O3 demo.cuf –o demo• ./demo


OpenMP Support for C/C++ and Fortran

§ What’s OpenMP• A de-facto industry standard for parallel programming over 15 years• Supports Fortran and C/C++ programming languages, both shared-memory

and accelerator programming models§ OpenMP accelerator support (new in OpenMP 4.0)

• The host device offloads target regions to the target devices• The target devices can be GPU, DSP, coprocessor etc.• Insert directives to offload code blocks to a target device

§ XL C/C++ and XL Fortran support of OpenMP 4.0 & 4.5• Commonly used subset: The beta in June and GA at the end of 2016• Incremental GA deliveries to complete the support in the future releases

§ Invocation examples• For CPU only

– xlC_r -O3 -qsmp=omp saxpy.cpp ; ./a.out– xlf_r -O3 -qsmp=omp saxpy.f ; ./a.out

• For GPU offloading– xlC_r -O3 -qsmp=omp –qoffload saxpy.cpp ; ./a.out– xlf_r -O3 -qsmp=omp –qoffload saxpy.f ; ./a.out


Compiler Options Quick Reference Guide POWER8OpenPower Linux (LE)

XL (xlc, xlC, xlf)http://ibm.biz/xlcpp-linuxhttp://ibm.biz/xlfortran-linux

GNU (gcc, g++, gfortran)http://gcc.gnu.org

Clanghttp://clang.llvm.org

ArchitectureGenerate instructions that run on POWER8

-mcpu=power8 or -qarch=pwr8 (default)

-mcpu=power8 -target powerpcle-unknown-linux-gnu -mcpu=pwr8

Optimization LevelsDisable all optimizations

-O0 -qnoopt (default) -O0 (default) -O0 (default)

Optimization levels -O-O2-O3-O4-O5

-O or -O1-O2-O3-Ofast

-O0-O2-O3-Os

Recommended optimization(A good balance between runtime performance and compilation time)

Commercial code-O3 or –O3 –qipaTechnical computing/analytic -O3 or –O3 –qhot

Commercial code-O3 -mcpu=power8Technical computing/analytic-O3 -mcpu=power8 -funroll-loops

-O2

Additional OptimizationsFeedback directed optimization

-qpdf1-qpdf2

-fprofile-generate -fprofile-use

-fprofile-instr-generate -fprofile-instr-use

InterproceduralOptimizations

-qipa -flto -flto

OpenMP -qsmp=omp -fopenmp -fopenmpLoop optimizations -qhot -fpeel-loops, -funroll-loops -funroll-loopsMore InfoResources from IBM http://ibm.biz/xlcpp-linux-ce

http://ibm.biz/xl-info http://ibm.biz/linuxonpowerhttp://ibm.biz/sdk-linuxonpower


Identify Application Hot Spots and Performance Bottlenecks


Hot Spot and Bottleneck Detection

§ Identify hot spots and detect bottlenecks• gather the profile information: timing, call frequency, block

frequency, frequently used values, • performance tools: gprof, oprofile, perf, PIF, IBM SDK, compiler

instrumentation§ Identify if a workload is computation intensive, memory latency or

bandwidth intensive; IO intensive by gathering performance counter information about

• CPI breakdown• FPU/FXU• Cache misses• Branch mispredictions• LSU, etc.


Profiling with gprof

§ gprof is a performance analysis tool using a hybrid of instrumentation and sampling

§ Step 1: Compile an application with XL compiler option –pg• Instrumentation code is inserted into the program code during

compilation to gather caller-function data at run-time. • xlc –O3 –pg –o app app.c

§ Step 2: Run the application• Sampling data is saved in 'gmon.out' or in 'progname.gmon' file

just before the program exits§ Step 3: Run gprof tool

• gprof app gmon.out > analysis.txt§ Step 4: Analyze the profiling information

• Each function, who called it, whom it called, and how many times • How many times each function got called, total times involved,

sorted by time consumed.


Profiling with operf/oprofile

§ oProfile is a system-wide statistical profiling tool and operf is the profiler tool provided with oprofile.

§ Step 1: Select performance events • ophelp to list available events

§ Step 2:Gather the profile information• operf -e event1[,event2[,...]] where event is specified by

event_name:sampling_rate[:unitmask[:kernel[:user]]]

§ Step 3: Analyze the profiling information• opreport -l


Profiling with perf

§ perf is a performance tool that automatically groups events, and cycles through them every N μsecs

§ Step 1: select performance events • perf list to list available events

§ Step 2:Gather the profile information• perf <command>

, where <command> = { lock, stat, sched, kmem, timechart, top, etc.}• perf record –e event_name| raw_PMU_event or perf -e event1[,event2[,...]]

where event is specified by event_name:sampling_rate[:unitmask[:kernel[:user]]] | \mem:addr[:[r][w][x]]

§ Step 3: Analyze the profiling information• perf report --source


Compiler Optimization: Basic and Advanced Optimization


Optimization Capabilities

§ Platform exploitation• qarch: ISA exploitation• qtune: skew performance tuning for specific processor, including

tune=balanced • Large library of compiler builtins and performance annotations

§ Mature compiler optimization technology• Five distinct optimization packages• Debug support and assembly listings available at all optimization

levels• Whole program optimization• Profile-directed optimization


§ Used at lower optimization levels

§ Focus on fast compilationnoopt-O2

§ More aggressive optimization, with limited impact on compilation time

-O3 -qnohot

Implies -qnostrict, which may affect program behavior(mainly precision of floating-point operations)

§ Optionally generate an assembly listing file

C/C++ FEFortran FE

xl*code

Source file

object file source.lst

Basic Compilation


§ Focus on runtime performance, at the expense of compilation time

• Aggressive loop transformations

• More precise dataflow analysis

§ Triggered by several compiler flags

-O3-qhot-qsmp

§ Multiple levels of aggressiveness for loop transformations

-qhot=level=0 (default at -O3)-qhot=level=1 (default at -qhot)-qhot=level=2

§ Can be combined with -qstrict

C FEFortran FE

ipa

Source file

object file

xl*code

source.lst

Advanced Compilation


§ Collect high-level program representation in preparation for link-time whole program optimization

§ Triggered by -qipaImplied by -O4, -O5, -qpdf1/-

qpdf2Identical behavior at all -qipa

levels§ Can be used independently of -

qhot§ Output is composite object file

• Includes regular object file and intermediate representation

• Allows linking the object file with or without link-time optimization

• Skip generation of regular object using -qipa=noobject

C/C++ FEFortran FE

ipa

Source file

extended object file

xl*code

object filesource.lst

Whole-Program Optimization – Compile Phase



system library object file § Intercept the system linker and

re-optimize whole program-qipa=level=0 (default with qpdf)-qipa=level=1 (default with qipa)-qipa=level=2

§ Must use the compiler invocation to link the program, with -qipa• Do not use ld directly

§ Flexible handling of extended objects• Can be placed in archives• Accepts combination of regular and

extended object files§ Whole program assembly listing

• Default name a.lst

§ Under -qpdf1/-qpdf2 the compiler collects and uses runtime profile information about the program


ipa

system library object file

xl*code

final object file

system linker

executable

a.lstprofile data file

Whole-Program Optimization – Link Phase


§ Noopt,-O0– Quick local optimizations– Keep the semantics of a program (-qstrict)

§ -O2 – Optimizations for the best combination of

compile speed and runtime performance– Keep the semantics of a program (-qstrict)

§ -O3– Equivalent to –O3 –qhot=level=0 –

qnostrict for XLC– Equivalent to –O3 –qhot –qnostrict for XLF– Focus on runtime performance at the

expense of compilation time: loop transformations, dataflow analysis

– May alter the semantics of a program (-qnostrict)

§ -O3 –qhot–Equivalent to –O3 –qhot=level=1 –

qnostrict–Perform aggressive loop

transformations and dataflow analysis at the expense of compilation time

§ -O4–Equivalent to –O3 –qhot=level=1 –

qipa=level=1 -qnostrict–Aggressive optimization: whole

program optimization; aggressive dataflow analysis and loop transformations

§ -O5–Equivalent to –O3 –qhot=level=1 –

qipa=level=2 -qnostrict–More aggressive optimization: more

aggressive whole program optimization, more precise dataflow analysis and loop transformations

Summary of Optimization Levels


Basic Optimization Techniques

§ Inlining• Replaces a call to a procedure by a copy of the procedure itself. It is done

to eliminate the overhead of calling the function, and also to allow specialization of the function for the specific call point

§ Redundancy detection• Identify computations that are redundant or partially redundant with values

previously computed, so their value can be reused rather than recomputed

§ Platform exploitation• Use a model of the target processor to determine the best mix of

instructions to use to implement a certain program sequence

§ Flow restructuring• Reorganize the code to increase the density of the hot code or to make it

less frequent for conditional branches to be taken


§ Analyze and transform loops to improve runtime performance• Analyze memory access patterns to improve cache utilization• Tailor instruction schedule for specific loop and target processor• Interleave execution of multiple loop iterations

§ Most effective on numerical applications, e.g. analytics, technical computing• Depends on loops with regular behavior that can be analyzed

and restructured by the optimizer

§ Enabled at O3 and above. Aggressive loop optimization with –O3 -qhot

Loop Optimization


§ Supports data types of INTEGER, UNSIGNED, REAL and COMPLEX

§ Explicit SIMD programming with –qaltivec (=BE|LE)

§ Automatic SIMDization at –O3 –qhot• Basic block level SIMDizaton• Loop level aggregation• Data conversion• Reduction• Loop with limited control flow• Math SIMDization• Partial Loop Vectorization• Alignment Handling

SIMDization/Vectorization


for (i=0; i<n; i++)

a[i] =

loop level

A = sqrt( B );C = sqrt( D );

Math Vectorization

multiple targets

load b[i]

load a[i] convert

add

store

load a[i+4]convert

add

store

INTEGER

FLOATFLOAT

data size conversion

b0 b1 b2 b3 b4 b5 b6 b7 b8 b916-byte boundaries

vload b[1]

b0 b1 b2 b3

vload b[5]

b4 b5 b6 b7

vpermute

b1 b2 b3 b4

...b1

b1

b1

alignment constraints

Partial Loop Vectorization

GENERIC

POWER BG/Q

Successful SIMDizer

a[i+0] = b[i+0] * c[i+0] + d[i+0]

a[i+1] = b[i+1] * c[i+1] + d[i+0]a[i+2] = b[i+2] * c[i+2] – d[i+0]a[i+3] = b[i+3] * c[i+3] - d[i+0]

Non-Isomorphic basic-block level

SIMDizera[i+0] = b[i+0] * c[i+0]

a[i+1] = b[i+1] * c[i+1]a[i+2] = b[i+2] * c[i+2]a[i+3] = b[i+3] * c[i+3]

Isomorphic basic-block level

for (i=1; i<n; i++)

a[i] = c[i] * d[i];b[i] = b[i-1];CELL


MASS Libraries

§ MASS stands for Mathematical Acceleration SubSystem.

§ MASS Libraries contain mathematical routines tuned for optimal performance on various POWER architectures• General implementation tuned for POWER• Specific implementations tuned for specific POWER

processors

§ 16x average speedup for POWER8 LE vector MASS vs. libm

§ Users can add explicit calls to the library

§ XL Compilers can automatically insert calls to MASS/MASSV routines at higher optimization levels


MASS Contains Over 140 Functions

§ Over 140 functions in all• Single/double precision• Scalar/SIMD/vector functions

§ Trigonometric functions and inverses• cos, sin, cosisin, sincos, tan, acos, asin, atan, atan2

§ Hyperbolic functions and inverses• acosh, asinh, atanh, acosh, asinh, atanh

§ Exponential functions• exp, exp2, expm1, exp2m1

§ Logarithm functions• log, log2, log10, log1p, log21p

§ Roots and reciprocal roots• sqrt, cbrt, qdrt, rsqrt, rcbrt, rqdrt

§ Reciprocal and divide• rec, div

§ Power• pow

§ Rounding, sign copy• aint, dint, anint, dnint, rint, copysign

§ Special functions• hypot, erf, erfc, lgamma, popcnt4, popcnt8


MASS Scalar Library

§ Analogous to libm math library§ Produces one math function result (exception: sincos)§ Easy to use in existing code since names match libm (just link MASS)§ Calling from C

#include <math.h> // prototypes for most scalar MASS functions

#include <mass.h> // prototypes for scalar MASS functions not in math.h

double dx, dy;dy = exp (dx); // compute dy = exponential function of dx

float fx, fy;

fy = expf (fx); // compute fy = exponential function of fx

double dx, dy, dz;dz = pow (dx, dy); // compute dz = dx to the power dy

double dx, dsin, dcos;

sincos (dx, &dsin, &dcos); // dsin=sin(dx), dcos=cos(dx)


MASS Scalar Library – Calling from Fortran

include 'mass.include' ! interfaces for non-intrinsic scalar MASS

real*8 dx, dy

dy = exp (dx) ! compute dy = exponential function of dx

real*4 fx, fy

fy = exp (fx) ! compute fy = exponential function of fx

real*8 dx, dy, dz

dz = dx**dy ! compute dz = dx to the power dy

real*8 dx, dsin, dcos

sincos (dx, dsin, dcos) ! dsin=sin(dx), dcos=cos(dx)


MASS Vector Library

§ Computes the same math function for each of multiple inputs§ Highest performance, provided vector length is sufficient

• vector length at least 2 to 10 depending on the function

§ Calling from C#include <massv.h> // prototypes for vector MASS functions

#define N 1000

int n=N;

double vdx[N], vdy[N];

vexp (vdy, vdx, &n); // vdy[i] = exp (vdx[i]), i=0,...,n-1

float vfx[N], vfy[N];

vsexp (vfy, vfx, &n); // vfy[i] = exp (vfx[i]), i=0,...,n-1

double vdx[N], vdy[N], vdz[N];

vpow (vdz, vdx, vdy, &n); // vdz[i] = pow (vdx[i], vdy[i]), i=0,...,n-1


MASS Vector Library – Calling from Fortran

include 'massv.include' ! interfaces for vector MASS functions

integer, parameter :: n=1000

real*8 vdx(n), vdy(n)

call vexp (vdy, vdx, n) ! vdy(i) = exp (vdx(i)), i=1,...,n

real*4 vfx(n), vfy(n)

call vsexp (vfy, vfx, n) ! vdy(i) = exp (vdx(i)), i=1,...,n

real*8 vdx(n), vdy(n), vdz(n)

call vpow (vdz, vdy, vdx, n) ! vdz(i) = vdx(i)**vdy(i), i=1,...,n


MASS SIMD Library

§ Computes the same math function for each element of a SIMD vector• Convenient when writing code with vector datatypes and built-in functions

– e.g. vector double, vector float, vec_add() etc.• Vector MASS recommended for best performance if vector length is non-trivial

§ Calling from C

#include <mass_simd.h> // prototypes for vector SIMD functions

vector double vdx, vdy;

vdy = expd2 (vdx); // vdy[i] = exp (vdx[i]), i=0,1

vector float vfx, vfy;

vfy = expf4 (vfx); // vfy[i] = exp (vfx[i]), i=0,1,2,3

vector double vdx, vdy, vdz;

vdz = powd2 (vdx, vdy); // vdz[i] = pow (vdx[i], vdy[i]), i=0,1


MASS SIMD Library -- Calling from Fortran

include 'mass_simd.include' ! interfaces for SIMD MASS functions

vector(real*8) vdx, vdy

vdy = expd2 (vdx) ! vdy(i) = exp (vdx(i)), i=1,2

vector(real*4) vfx, vfy

vfy = expf4 (vfx) ! vfy(i) = exp (vfx(i)), i=1,2,3,4

vector(real*8) vdx, vdy, vdz

vdz = powd2 (vdx, vdy) ! vdz(i) = vdx(i)**vdy(i), i=1,2


Linking MASS (for manual use)

§ Scalar MASS• -l mass (common for all supported POWER processors)

§ Vector MASS• -l massv generic for all supported POWER processors• -l massvp8 for POWER8 (available for LE or BE)

§ SIMD MASS• -l mass_simdp8 for POWER8 (available for LE or BE)

§ Example of linking all MASS libraries when compiling for POWER8

• xlc main.c -qarch=pwr8 -qaltivec -q64 -l mass -l massvp8 -l mass_simdp8


§ XL C/C++ and XL Fortran compilers are able to• recognize opportunities in source code to use MASS• Auto-vectorize: generate calls to MASS vector functions• Auto-inline: inline MASS scalar functions• Auto-scalarize: generate calls to MASS scalar functions

§ Compiler optimization levels of “-O3 –qhot” or above for automatic MASS exploitation

§ Transformation report shows automatic MASS usage

for (i=0;i<n;i++) {b[i]=sqrt(a[i]);

} __vsqrt_P8(b,a,n);

Loop vectorizationwas performed.

Transformation report

MASS Library Exploitation with XL Compilers


§ User-driven parallelism• All optimization levels interoperate with POSIX Threads implementation• Full OpenMP 3.1 implementation provides simple mechanism to write

parallel applications – Based on pragmas/annotations on top of sequential code– Industry specification, developed by OpenMP consortium

(www.openmp.org)

§ Compiler-driven parallelism• Mechanism for the compiler to automatically identify and exploit data

parallelism • Identify parallelizable loops, performing independent operations on

arrays or vectors– Best results on loop-intensive, compute-intensive workloads– Aided by program annotations, fully interoperable with OpenMP

Parallelization


§ Optimize the whole program at module scope• Intercept the linker and re-optimize the program at module scope

§ Three levels of aggressiveness (-qipa=level=0/1/2)• Balance between aggressive optimization and longer optimization

time

§ Enables additional program optimization• Cross-file inlining (including cross-language)• Global code placement based on call affinity• Global data reorganization

§ Reduction in TOC pressure, through data coalescing

Inter-Procedural Analysis (IPA)


§ Collect program statistics on training run to use on subsequent optimization phase

• Minor impact on execution time of instrumented program (10% - 50%)• Static program information: Call frequencies, basic block execution counts• Value profiling: collect histogram of values for expressions of interest• Hardware counter information (optional)

§ Supports multiple training runs and parallel instances of the program• Profiling information from multiple training runs aggregated into single file• Locking used to avoid clobbering of the profiling data on file

§ Integrated with IPA process (implies ipa=level=0)• PDF synchronization point at beginning of link-time optimization phase• No need to recompile source files for PDF2, only relink with qpdf2 option

§ Tolerates program changes between instrumentation/optimization• Compiler skips profile-based optimization for any modified functions• Shows an estimate of the relevance of the profiling data

Profile-Directed Optimization (PDF)


Performance Tuning Tips with XL Compilers and Libraries


Frequently used XL Compiler Options§ Typically start from -O2 or -O3 § Add high order optimization –qhot for floating-point computation intensive

and memory intensive workload, e.g., a lot of time spent on loops§ Add whole program optimization –qipa[=level=0 | 1 | 2] for workloads with

a lot of C/C++ small function calls§ Add profile directed feedback optimization –qpdf1/pdf2 for workloads with

lots of branching and function calls Usage:

1.Instrumentation:export PDFDIR=your_work_dirCompile a program with –qpdf1=exename to generate instrumented executable2. Profile:Use typical input data to run the executable and generate the profile data in PDFDIR3. Recompile:Re-compile the program with –qpdf2=exename to generate optimized executable

§ Add –qsmp=omp for OpenMP workloads (-qoffload for GPU exploitation)§ Add cuda option for CUDA workloads


Performance Tuning using Compiler Transformation Reports

§ Generate compilation reports consumable by other tools• Enable better visualization and analysis of compiler information• Help users do manual performance tuning• Help automatic performance tuning through performance tool

integration§ Unified report from all compiler subcomponents and analysis

• Compiler options• Pseudo-sources• Compiler transformations, including missed opportunities

§ Consistent support among Fortran, C/C++ § Controlled under option

-qlistfmt=[xml | html]=inlines generates inlining information-qlistfmt=[xml | html]=transform generates loop transformation information-qlistfmt=[xml | html]=data generates data reorganization information-qlistfmt=[xml | html]=pdf generates dynamic profiling information-qlistfmt=[xml | html]=all turns on all optimization content-qlistfmt=[xml | html]=none turns off all optimization content


file.c

foo (float *p, float *q, float *r, int n) {

for (int i=0; i< n; i++) {p[i] = p[i] + q[i]*r[i];

}}

Performance Tuning with Compiler Reports

-qlistfmt=xml=all

file.c

foo (float * restrict p, float * restrict q, float * restrict r, int n) {

for (int i=0; i< n; i++) {p[i] = p[i] + q[i]*r[i];

}}

file.xmlLoop was not SIMD vectorized bacause a data dependence prevents SIMD vectorization

Original source file modified source file

file.xmlLoop was SIMD vectorized

Tuning


SIMDization Tuning

memory accesses have non-vectorizable alignment.

§Use __attribute__((aligned(n)) to set data alignment§Use __alignx(16, a) to indicate the data alignment to the compiler §Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

§Use fewer pointers when possible§Use #pragma independent_loop if it has no loop carried dependency§Use restrict keyword

User actionsTransformation report

Loop was SIMD vectorized

§Use #pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable to vectorize


memory accesses have non-vectorizable strides

§Loop interchange for stride-one accesses, when possible§Data layout reshape for stride-one accesses §Higher optimization to propagate compile known stride information§Stride versioning

§Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization.

§Convert while-loops into do-loops when possible§Limited use of control flow in a loop §Use MIN, MAX instead of if-then-else§Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

SIMDization Tuning


Compiler Friendly Code

§ Compiler must be conservative when determining potential side effects

• Procedure calls may access or modify any visible variables• Accesses through pointers may modify any visible variables

§ Pessimistic side effect analysis prevents compiler optimizations• Must re-compute expressions with operands which may have been modified• Must compute values that otherwise might be unneeded

§ Help the compiler identify side effects to improve application performance

• Use suitable optimization levels• Include appropriate header files for any system routines in use• Use local variables to maintain values of global variables across function calls

or pointer dereferences• Avoid using global variables when local variables are suitable• Avoid reusing local variables for unrelated purposes• Follow ANSI C/C++ language pointer aliasing rules– An object of a certain data type can only be accessed through a pointer of the same (or

compatible) data type


§ Use restrict keyword (XLC supports multiple level and scope restricted pointer) or compiler directives/pragmas to help the compiler do dependence and alias analysis

§ Use “const” for globals, parameters and functions whenever possible§ Group frequently used functions into the same file (compilation unit)

to expose compiler optimization opportunity (e.g., intra compilation unit inlining, instruction cache utilization)

§ Limit exception handling§ Excessive hand-optimization such as unrolling may impede the

compiler§ Keep array index expressions as simple as possible for easy

dependency analysis§ Consider using the highly tuned MASS/MASSV and ESSL libraries

Compiler Friendly Code


§ Make use of visibility attribute• Load time improvement• Better code with PLT overhead reduction• Code size reduction• Symbol collision avoidance

§ Inline tuning • Call overhead reduction• Load-hit-store avoidance

§ Whole program optimization by IPA• Across-file inlining• Code partitioning• Data reorganization • TOC pressure reduction

Performance Tuning Tips


§ OpenMP Environment VariablesOMP_NUM_THREADS: control the number of threadsOMP_THREAD_LIMIT: control the maximum number of threadsOMP_PLACES: control thread affinity THREADS | CORES | SOCKETSOMP_WAIT_POLICY: control the thread idle policy: ACITVE | PASSIVEOMP_STACKSIZE: control the thread stack sizeOMP_SCHEDULE: control the schedule type DYNAMIC | GUIDED | STATICOMP_PROC_BIND control thread binding TRUE | FALSE, MASTER | CLOSE | SPREADOMP_DYNAMIC: control dynamic thread adjustment TRUE | FALSEOMP_DISPLAY_ENV: control to display environment variables TRUE | FALSE

§ OpenMP/OpenMPI affinity• OpenMP programs will automatically detect whether they have been invoked by OpenMPI via

OpenMPI-set environment variables. • When OpenMPI has been detected, OpenMP will restrict the default OMP_PLACES to the affinity

that has been set for that process– e.g., with -binds-to core, each OpenMPI process will be placed on a different core – each

OpenMP program will be restricted to the particular core for that process– This feature can be overridden by manually setting OMP_PLACES, i.e., this feature only

applies to the default setting for OMP_PLACES– OMP_PROC_BIND will be set to TRUE

OpenMP Tuning


§ System configuration • Adjust SMT level: ppc64_cpu --smt=<level>• Adjust hardware prefetch aggressiveness: ppc64_cpu --dscr=<value>• Adjust cpu/memory affinity: numactl <flags>• Set huge pages: sysctl -w vm.nr_hugepages=<number>

§ POWER8 exploitation• POWER8 specific ISA exploitation under –qarch=pwr8• Scheduling and instruction selection under –qtune=pwr8:SMTn (n=1, 2, 4, 8)

§ Automatic SIMDization at O3 –qhot• Limited use of control flow• Limited use of pointers. Use independent_loop directive to tell the compiler a loop has

no loop carried dependency; use either restrict keyword or disjoint pragma to tell the compiler the references do not share the same physical storage whenever possible

• Limited use of stride accesses. Expose stride-one accesses whenever possible§ Data prefetch

• Automatic data prefetch at O3 –qhot or above. • -qprefetch=dscr=N to control hardware prefetch aggressiveness

Architecture and System Specific Tuning Tips


Floating-point Computation Accuracy Control

§ Aggressive optimization may affect the results of the program• Precision of floating-point computation• Handling of special cases of IEEE FP standard (INF, NAN, etc)• Use of alternate math libraries

§ -qstrict guarantees identical result to noopt, at the expense of optimization

• Suboptions allow fine-grain control over this guarantee• Examples:

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

§ Can be combined: -qstrict=precision:nonans


This presentation addresses:• What are frequently used XL compiler options• How to identify program hot spots and detect performance

bottlenecks with XL compilers and performance tools• How to write compiler-friendly code for better performance• How to do performance tuning with XL compilers and libraries• How to do POWER8 specific optimization

Summary




Important XL Compilers Links

§ XL C/C++ home page•http://ibm.biz/xlcpp-linux

§ C/C++/Fortran Community•http://ibm.biz/xlcpp-linux-ce

§ XL Fortran home page•http://ibm.biz/xlfortran-linux

@IBM_compilers56 IBM C/C++ and Fortran Compilers and Libraries56 IBM Confidential

Additional information

§ IBM SDK Linux http://ibm.biz/ibmsdklop

§ PMU eventshttp://www-01.ibm.com/support/knowledgecenter/linuxonibm/liaal/iplsdkcpievents.htm

§ Code optimization with the IBM XL compilers on Power architectureshttp://www-01.ibm.com/support/docview.wss?uid=swg27005174&aid=1

§ Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8http://www.redbooks.ibm.com/abstracts/sg248171.html

§ Implementing an IBM High-Performance Computing Solution on IBM POWER8http://www.redbooks.ibm.com/abstracts/sg248263.html?Open

§ NVIDIA CUDA on IBMPOWER8: Technical overview, software installation, and application developmenthttp://www.redbooks.ibm.com/redpapers/pdfs/redp5169.pdf


This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area.

Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquiries, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied.All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions.IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice.IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

IBM, the IBM logo, ibm.com AIX, AIX (logo), IBM Watson, DB2 Universal Database, POWER, PowerLinux, PowerVM, PowerVM (logo), PowerHA, Power Architecture, Power Family, POWER Hypervisor, Power Systems, Power Systems (logo), POWER2, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, POWER7, POWER7+, and POWER8 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

NVIDIA, the NVIDIA logo, and NVLink are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries.Linux is a registered trademark of Linus Torvalds in the United States, other countries or both.PowerLinux™ uses the registered trademark Linux® pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the Linux® mark on a world-wide basis.The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.The OpenPOWER word mark and the OpenPOWER Logo mark, and related marks, are trademarks and service marks licensed by OpenPOWER.

Other company, product and service names may be trademarks or service marks of others.

Notices and Disclaimers

Documents

IBM XL Compilers Performance Tuning 2016-11-18