HPC Workload Performance Tuning with IBM XL …spscicomp.org/.../03/gao-HPC-workload-performance-tuning-with-IBM... · HPC Workload Performance Tuning with ... Target AIX, Linux on

© 2013 IBM Corporation

HPC Workload Performance Tuning with IBM XL Compilers and Libraries

SPXXL/Scicomp 2013 Summer Meeting

Yaoqing Gao [email protected]

IBM Toronto Lab


IBM | Software Group | Rational

Disclaimer © Copyright IBM Corporation 2013. The information contained in these materials is provided for

informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.

Please Note:

–  IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

–  Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

–  The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

May 2013 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries 2



Goals § What are frequently used compiler option sets

§ What are frequently used compiler directives/pragmas

§ How to detect program hot spots and performance bottlenecks with XL compilers and performance tools

§ How to write compiler-friendly code for better performance

§ How to do performance tuning with XL compilers

§ How to do POWER7 optimization

HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 3



HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013

Agenda

§ Overview of XL Compilers § Performance Controls

– Compiler options – Directives and pragmas

§ Hot spot and bottleneck detection § Program tuning for performance § POWER7 optimization § Documentation § Q&A

4



Session 1: Overview of XL Compilers





Overview of XL Compiler Family § Similar compilation technology used to implement C, C++ and

Fortran Compilers § Target AIX, Linux on Power, zOS (C/C++ only), BlueGene, Cell § Language standard compliance

– Language standard compliance – C99 Standard compliance – C++98 and subsequent TRs, Selected C++11 features – Fortran 2003 Standard compliance – OpenMP implementation for parallel programming

§ Fully backward compatible with objects compiled with older compilers

– Supports mix-and-match of objects generated with different compilers and optimization levels

– Backward compatibility through option control in some rare situations: •  C++ name mangling, OpenMP TLS, etc

§ GCC affinity – Partial source and full binary compatibility with gcc

6




Overview of XL Compiler Family

§ Platform exploitation – qarch: ISA exploitation – qtune: skew performance tuning for specific processor, including

tune=balanced – Large portfolio of compiler builtins and performance annotations

§ Advanced optimization capabilities – Five distinct optimization packages – Aggressive loop analysis and transformations (unimodular and polyhedral framework)

– Whole program optimization – SIMD code generation and Vectorization exploitation – Parallelization (automatic and user-driven through OpenMP)

– Profile-driven optimization 7




IBM XL Compiler architecture Basic compilation environment

§ Used at lower optimization levels

§ Focus on fast compilation noopt -O2

§ More aggressive optimization, with limited impact on compilation time

-O3 -qnohot Implies -qnostrict, which may affect program behavior (mainly precision of floating-point operations)

§ Optionally generate an assembly listing file

C FE C++ FE Fortran FE

xl*code

Source file

object file source.lst

8




IBM XL Compiler architecture Advanced compilation environment

§ Focus on runtime performance, at the expense of compilation time

– Aggressive loop transformations

– More precise dataflow analysis

§ Triggered by several compiler flags

-O3 -qhot -qsmp

§ Multiple levels of aggressiveness for loop transformations

-qhot=level=0 (default at -O3) -qhot=level=1 (default at -qhot) -qhot=level=2

§ Can be combined with -qstrict


ipa

Source file

object file

xl*code

source.lst

9




IBM XL Compiler architecture Whole-program analysis environment - compile phase

§  Collect high-level program representation in preparation for link-time whole program optimization

§  Triggered by -qipa Implied by -O4, -O5, -qpdf1/-

qpdf2 Identical behavior at all -qipa

levels

§  Can be used independently of -qhot

§  Output is composite object file –  Includes regular object file

and intermediate representation

– Allows linking the object file with or without link-time optimization

– Skip generation of regular object using -qipa=noobject


ipa

Source file

extended object file

xl*code

object file source.lst

10





system library object file

IBM XL Compiler architecture Whole-program analysis environment - link phase

§  Intercept the system linker and re-optimize whole program -qipa=level=0 (default with qpdf) -qipa=level=1 (default with qipa) -qipa=level=2

§  Must use the compiler invocation to link the program, with -qipa –  Do not use ld directly

§  Flexible handling of extended objects –  Can be placed in archives –  Accepts combination of regular and

extended object files

§  Whole program assembly listing –  Default name a.lst

§  Under -qpdf1/-qpdf2 the compiler collects and uses runtime profile information about the program


ipa

system library object file

xl*code

final object file

system linker

executable

a.lst profile data

file

11



Session 2: Performance Controls with XL Compiler options, flags, directives and pragmas




Performance Compiler Options § Optimization levels:

–  -O0-5

§ High order transformations: –  -qhot

§  Interprocedural analysis: –  -qipa or –O4 or –O5

§ Profile directed feedback optimization: –  -qpdf1/-qpdf2

§ Target machine specification: –  -qarch=pwr7 –qtune=pwr7 –qcache=auto

§ Floating point options: –  -qstrict=subopt, -qfloat=subopt

§ Program behavior options: –  -qassert=subopt, etc.

§ Diagnostic options: –  -qlist, -qreport, -qlistfmt, etc HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 13



Summary of Optimization Levels §  Noopt,-O0

– Quick local optimizations –  Keep the semantics of a program (-qstrict)

§  -O2 – Optimizations for the best combination of compile speed and runtime performance –  Keep the semantics of a program (-qstrict)

§  -O3 –  Equivalent to –O3 –qhot=level=0 –qnostrict –  Focus on runtime performance at the expense of compilation time: loop transformations, dataflow

analysis – May alter the semantics of a program (-qnostrict)

§  -O3 –qhot –  Equivalent to –O3 –qhot=level=1 –qnostrict –  Perform aggressive loop transformations and dataflow analysis at the expense of compilation time

§  -O4 –  Equivalent to –O3 –qhot=level=1 –qipa=level=1 -qnostrict –  Aggressive optimization: whole program optimization; aggressive dataflow analysis and loop

transformations

§  -O5 –  Equivalent to –O3 –qhot=level=1 –qipa=level=2 -qnostrict – More aggressive optimization: more aggressive whole program optimization, more precise

dataflow analysis and loop transformations





HPC Performance Tuning with XL Compilers

-O3 –qarch=pwr7 or

–O3 –qhot –qarch=pwr7

with –qnostrict or –qstrict

Profiling for hot spot detection:

§ Compiler instrumentation: –qpdf1=level={1,2} / pdf2

§ -pg for gprof/xprofiler; -qlist for tprof

§ User-provided profile functions: -qfunctrace

SIMDization:

§ Automatic SIMDization: –O3 or above with –qsimd

§ User explicit SIMD program: -qaltivec

Loop transformations:

§ Loop transformations: –O3 or above

Whole program optimizations:

§ -O4 or –O5 for inter-procedural optimization: inlining, code partition, data reorganization

Parallelization:

§ User explicit parallelization only: -qsmp=omp

§ Auto parallelization: -qsmp (-qsmp=auto)

XML Transformation Reports

§ -qlistfmt=xml=all

Frequently used

option set

15




Pragmas (C/ C++) §  #pragma align

§  #pragma alloca (C only)

§  #pragma block_loop

§  #pragma chars

§  #pragma comment

§  #pragma define, #pragma instantiate (C++ only)

§  #pragma disjoint

§  #pragma do_not_instantiate (C++ only)

§  #pragma enum

§  #pragma execution_frequency

§  #pragma expected_value

§  #pragma fini (C only)

§  #pragma hashome (C++ only)

§  #pragma ibm snapshot

§  #pragma implementation (C++ only)

§  #pragma info

§  #pragma init (C only)

§  #pragma ishome (C++ only)

§  #pragma isolated_call

§  #pragma langlvl (C only)

§  #pragma leaves

§  #pragma loopid

§  #pragma map

§  #pragma mc_func

§  #pragma namemangling (C++ only)

§  #pragma namemanglingrule (C++ only)

§  #pragma nosimd

§  #pragma novector

§  #pragma object_model (C++ only)

§  #pragma operator_new (C++ only)

§  #pragma options

§  #pragma option_override

§  #pragma pack

§  #pragma pass_by_value (C++ only)

§  #pragma priority (C++ only)

§  #pragma reachable

§  #pragma reg_killed_by

§  #pragma report (C++ only)

§  #pragma simd_level

§  #pragma STDC cx_limited_range

§  #pragma stream_unroll

§  #pragma strings

§  #pragma unroll

§  #pragma unrollandfuse

§  #pragma weak

§  #pragma ibm independent_loop

16

Parallel processing

§  #pragma ibm critical (C only) §  #pragma ibm independent_calls (C only) §  #pragma ibm iterations (C only) §  #pragma ibm parallel_loop (C only) §  #pragma ibm permutation (C only) §  #pragma ibm schedule (C only) §  #pragma ibm sequential_loop (C only) §  #pragma omp atomic §  #pragma omp parallel §  #pragma omp for §  #pragma omp ordered §  #pragma omp parallel for §  #pragma omp section, #pragma omp sections §  #pragma omp parallel sections §  #pragma omp single §  #pragma omp master §  #pragma omp critical §  #pragma omp barrier §  #pragma omp flush §  #pragma omp threadprivate §  #pragma omp task §  #pragma omp taskwait

Frequently used

16



Directives (Fortran) §  ALIGN §  ASSERT

§  BLOCK_LOOP

§  CNCALL

§  COLLAPSE

§  EJECT

§  EXECUTION_FREQUENCY

§  EXPECTED_VALUE

§  INCLUDE

§  INDEPENDENT

§  #LINE

§  LOOPID

§  MEM_DELAY

§  NOSIMD

§  NEW

§  NOVECTOR

§  PERMUTATION

§  SIMD_LEVEL

§  SNAPSHOT

§  SOURCEFORM

§  STREAM_UNROLL

§  SUBSCRIPTORDER

§  UNROLL

§  UNROLL_AND_FUSE

§  Hardware-specific directives

•  CACHE_ZERO •  DCBF •  DCBFL •  DCBST •  EIEIO •  ISYNC •  LIGHT_SYNC •  PREFETCH •  PROTECTED STREAM

17

Frequently used

May 2013 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries




Frequently Used Pragmas/Directives/Attributes

§  Dependency –  #pragma ibm independent_loop –  #pragma disjoint

§  Frequency –  #pragma execution_frequency –  #pragma expected_value –  #pragma ibm min_iterations –  #pragma ibm max_iterations –  #pragma ibm iterations

§  Alignment –  __alginx –  __attribute__ {(aligned(16))}

§  SIMDization –  #pragma nosimd –  #pragma simd_level

§  Unroll –  #pragma unroll

18



Restricted Pointer Enhancements in XL C/C++ V12.1 Restricted parameter pointer: void function(float * restrict a1. float *restrict a2} {

for ( int i=0; i<n; i++) {

a1[i] = a2[i];

}

}

Block scope restricted pointer: float * restrict a1 = A1; float * restrict a2 = A2;

for ( int i=0; i<n; i++) {

a1[i] = a2[i];

}

Multiple level restricted pointer:

float * restrict *restrict * restrict aa1 = AA1;

float * restrict *restrict * restrict bb1 = BB1;

for (int k=0; k<n3; k++) {

for (int j=0; j<n2; j++) {

for (int i=0; i< n1; i++) {

aa1[i][j][k] = bb1[i][j][k];

} } }

•  Determine if two different pointers are being used to reference different objects

•  Refine aliasing to expose optimization opportunities;




module mod

real :: x(500), y(500), z(500)

!ibm* align(16, x, y, z)

end module

subroutine partial_sum(m, n)

use mod

integer, intent(in) :: m, n

!ibm* assert(itercnt(40))

ibm* assert(itercnt(120))

do i=m, n

z(i) = x(i) + 1.37*y(i)

enddo

end subroutine

IBM align and iteration count directives

20 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013

Guide the compiler to align arrays x, y, and z to 16 bytes

•  Avoid cache conflicts and false sharing

•  Expose SIMDization opportunity

The frequently used loop iteration counts are 40 and 120

•  Guide the compilers for profitability analysis

•  Expose loop optimization opportunities



Summary

§ Frequently used compiler option sets – -O3 –qarch=pwr7 –qtune=pwr7 – -O3 –qhot –qarch=pwr7 –qtune=pwr7

§ Frequently used compiler directives/pragmas – Dependency and alias analysis – Alignment – Frequency – Program behavior – Transformations




Session 3: Program hot spot and performance bottleneck detection with XL Compilers and performance tools





Profiling with XL Compilers and gprof/xprof

§ Compiler option -pg for gprof/xprof – gmount.out will be generated at run-time

§ gprof output (gprof cba gmount.out > gprof.out ) % cumulative self self total

time seconds seconds calls us/call us/call name

92.84 18.18 18.18 213340 85.21 85.21 .bb_it

4.81 19.12 0.94 410042542 0.00 0.00 .__gmon_start__

1.69 19.45 0.33 .main

0.61 19.57 0.12 106669 1.13 171.54 ._bit_array

0.00 19.57 0.00 213336 0.00 0.01 .test_it

0.00 19.57 0.00 2 0.00 0.00 ._bit_array_mn 23



Profiling using functrace feature of XL Compilers

extern "C“ void __func_trace_enter (

const char *function_name,

const char *file_name,

int line_number,

void** const user_data){…}

extern "C“ void __func_trace_exit (



int line_number,


extern "C“ void __func_trace_catch(



int line_number,


void my_function(…)

{ /* entry point */

…..

/* exit point

}

-qfunctrace

-qfunctrace+<list> trace a list of function names,

class names and name scopes

-qfunctrace-<list> do not trace a list of function names,

class names and name scopes

-qoptfile=option_file specify the name of a file

containing –qfunctrace

User-defined tracing functions User functions

Tracing function insertion

by compiler under -qfunctrace





Profile Directed Feedback Optimization with XL Compilers §  Step 1: compiler instrumentation with the option –qpdf1 to generate an instrumented executable

§  Step 2: profile information gathering with the representative input data

§  Step 3: profile directed feedback optimization by re-compiling with –qpdf2 to generate an optimized executable

Call Counter Information Region # 1

Region Execution Count 1

Call Coverage 32 / 49

Call Counters

Call Name Call Execution Count Line #

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Region # 1

Region Execution Count 5

Block Coverage 81 / 81

Block Counters

Block Index

Block Execution Count

Start Line #

End Line #

3 5 1 33

4 5 33 34

Basic Block Counter Information

……

25




SOURCE CODE

COMPILE AND LINK WITH –qpdf1 Static analysis Profile based refinement

COMPILE AND LINK WITH –qpdf2 Profile directed optimizations

INSTRUMENTED APPLICATION

OPTIMIZED APPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

26



Profiling with other performance tools § HPC/HPCS toolkit

– e.g. , hpccount -g <group[ ,<group>,...]> <program>

§ Oprofile on LINUX and tprof on AIX


Subroutine Ticks % Source Address Bytes

========== ===== ====== ====== ======= =====

.match 522 1.80 scanner.c 34b0 2100

.train_match 61 0.21 scanner.c 5950 30e0

.simtest2 12 0.04 scanner.c 55b0 3a0

4 566| 0004B0 stfd D8C40010 1 STFL (*)aggr#2.W.rns88.(gr4,16)=fp6

- 567| 0004B4 fmadd FC0601BA 1 FMA fp0=fp0,fp6,fp6,fcr

- 566| 0004B8 lwz 80670004 1 L4A gr3=(*)aggr#2.I.rns21.(gr7,4)

- 566| 0004BC addi 38840040 1 AI gr4=gr4,64

2 566| 0004C0 fmadd FD1D413A 1 FMA fp8=fp8,fp29,fp4,fcr

- 566| 0004C4 lwz 80A70014 1 L4A gr5=(*)aggr#2.I.rns21.(gr7,20)

- 566| 0004C8 lfd C8860000 1 LFL fp4=(*)double.rns81.(gr6,0)

- 566| 0004CC lwz 80C70008 1 L4A gr6=(*)aggr#2.I.rns21.(gr7,8)

23 566| 0004D0 lfd CBC90008 1 LFL fp30=(*)aggr#2.U.rns89.(gr9,8)

Annotated assembly

cpu distribution

27

ticks #line Instruction




Reading the listing file 79| 0054FC subf 7C641850 1 S gr3=gr3,gr4 79| 005500 sradi 7C630E74 1 SRA8C gr3=gr3,1 79| 005504 addze 7C630194 1 79| 005508 std F86102A8 1 ST8 @__17212.nat.3(gr1,680)=gr3 79| 00550C std F86102B0 1 ST8 __17212(gr1,688)=gr3 79| CL.592: 81| 005558 addi 390000DE 1 LI gr8=222 81| 00555C bl 4BFFAAA5 1 CALL gr3=ab_resize,6,@PALI_SHADOW_CONST.rns909.,gr3-gr8,ab_resize",gr1,cr[01567]",gr0",gr4"-gr12",fp0"-fp13“ 81| 005560 ori 60000000 1 81| 005564 std F86102D8 1 ST8 p(gr1,728)=gr3 53| 005568 std F86102E0 1 ST8 __17475(gr1,736)=gr3 53| 00556C b 4800000C 1 B CL.594,-1

Line no

XIL Meaningless cycle count from power1

Actual binary

Machine opcode

Code offset

28



Tips for Hot Spot and Bottleneck Detection §  Identify hot spots by gathering the profile information such as timing, call

frequency, block frequency, frequently used values, etc., using – gprof/oprofile – Combination of user provided trace functions and compiler instrumentation with -

qfunctrace – Compiler instrumentation with –qpdf1/pdf2 for profile directed feedback(PDF) – HPC toolkit

§  Identify if a workload is memory latency/bandwidth intensive; computation intensive; or combination of both, by gathering performance counter information about

– CPI breakdown – FPU/FXU – Cache misses – Branch mispredictions – LSU – …




Session 4: Performance tuning with XL compilers





XML Compiler Transformation Reports § Generate compilation reports consumable by other tools

–  Enable better visualization and analysis of compiler information –  Help users do manual performance tuning –  Help automatic performance tuning through performance tool

integration

§ Unified report from all compiler subcomponents and analysis –  Compiler options –  Pseudo-sources –  Compiler transformations, including missed opportunities

§ Consistent support among Fortran, C/C++ § Controlled under option

-qlistfmt=[xml | html]=inlines generates inlining information -qlistfmt=[xml | html]=transform generates loop transformation information -qlistfmt=[xml | html]=data generates data reorganization information -qlistfmt=[xml | html]=pdf generates dynamic profiling information -qlistfmt=[xml | html]=all turns on all optimization content -qlistfmt=[xml | html]=none turns off all optimization content

Example

new

31




Source location

Loop information Loop Table Loop Index

Start Line #

End Line #

Parent Loop Index Nest Level Minimum

Cost Maximum

Cost Iteration

Count Attributes

1 203 1 19630 19630 75 (array)

•  well behaved •  bump normalized •  lower bound

normalized

2 188 1 20413 20413 149 (array)

•  perfect nest •  well behaved •  bump normalized •  guarded •  lower bound

normalized

3 141 1 13300 13300 100 (default)

•  perfect nest •  well behaved •  bump normalized •  guarded •  lower bound

normalized

4 203 1 19630 19630 5 (PDF)

•  residual •  well behaved •  bump normalized •  guarded •  lower bound

normalized

Loop iteration count based on static analysis or dynamic profiling

32




Loop Transformation Reports

Seq # Type Phase Region # Line # Loop Index

Description Attributes

2 LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorization was performed.

not available

3 LoopFusion (success)

High Level Optimizer 4 108 3 Loops were

fused.

• Loop Line Number: 108 • Loop Line Number: 206

4 LoopVectorVersion (success)

High Level Optimizer 4 108 3

Vector versioning was performed.

not available

20 ModuloSchedule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled.

• Initiation Interval: 12

33




file.c foo (float *p, float *q, float *r, int n) { for (int i=0; i< n; i++) { p[i] = p[i] + q[i]*r[i]; } }

Performance Tuning with Compiler Transformation Reports -qlistfmt=xml=all

file.c foo (float * restrict p, float * restrict q, float * restrict r, int n) { for (int i=0; i< n; i++) { p[i] = p[i] + q[i]*r[i]; } }

file.xml Loop cannot be automatically parallelized. A dependency is carried by variable aliasing

Original source file modified source file

file.xml Loop was automatically parallelized Loop was modulo scheduled

Tuning

34




Explicit SIMD programming for POWER7 Enabled under -qaltivec § Successor to altivec programming extensions on POWER6/PPC970

– Altivec data types 16-byte vectors vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

– VSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

§ Altivec built-in functions extended to new data types vec_add(vector double, vector double), vec_sub(vector long long, vector long long),

§ New vector operations: vec_mul, vec_div, … § Unaligned load and store operations

– Altivec truncating loads/stores still available: vec_ld, vec_st – New non-truncating loads/stores: vec_xld2, vec_xstd2

35




VSX Example

Compiler Listing:

4| CL.5:

0| 000050 addi 38840010 1 AI gr4=gr4,16

7| 000054 xvsqrtdp F060032C 1 VDFSQRT vs3=vs0,fcr 7| 000058 xvcpsgnd F0021780 1 LRVS vs0=vs2

7| 00005C xvmuldp F0442380 1 VDFM vs2=vs4,vs4,fcr

6| 000060 lxvd2x 7C840698 1 VLQD vs4=y[]@gr612->.y(gr4,gr0,0)

7| 000064 xvmaddad F0010B08 1 VDFMA vs0=vs0,vs1,vs1,fcr

5| 000068 lxvd2x 7C250698 1 VLQD vs1=x[]@gr609->.x(gr5,gr0,0) 0| 00006C addi 38A50010 1 AI gr5=gr5,16

8| 000070 stxvd2x 7C630798 1 VSTQD z[]@gr621->.z(gr3,gr0,0)=vs3

0| 000074 addi 38630010 1 AI gr3=gr3,16

0| 000078 bc 4320FFD8 1 BCT ctr=CL.5,,100,0

C: #include <math.h> #include <altivec.h> extern double x[1000],y[1000],z[1000]; void sub(){ int i; vector double x2,y2,z2; for(i=0;i<1000;i+=2) { x2=vec_xld2(0,&x[i]); y2=vec_xld2(0,&y[i]); z2=vec_sqrt(vec_add( vec_mul(x2,x2),vec_mul(y2,y2))); vec_xstd2(z2,0,&z[i]); }}

Fortran: subroutine sub(x,y,z) real*8 x(*),y(*),z(*) vector(real(8)) x2,y2,z2 do i=1,1000,2 x2=vec_xld2(0,x(i)) y2=vec_xld2(0,y(i)) z2=vec_sqrt(vec_add( vec_mul(x2,x2),vec_mul(y2,y2))); call vec_xstd2(z2,0,z(i)) enddo return end

xlc –O3 –qarch=pwr7 –qvecnvol –qaltivec py.c xlf90 –O3 –qarch=pwr7 –qvecnvol py.f

36




Automatic SIMDization § Automatic SIMDization for VMX and VSX

– Supports data types of INTEGER, UNSIGNED, REAL and COMPLEX

§ Features: – Basic block level SIMDizaton – Loop level aggregation – Data conversion, reduction – Loop with limited control flow – Automatic SIMDization with -qstrict (VSX) and -qnostrict – Support of unaligned vector memory accesses (VSX) – Automatic SIMDization enabled at -O3 -qsimd

37




AutoSIMD: VSX example

Compiler Listing:

0| CL.115:

9| VLQD [email protected][].rns2.0(gr11,gr12,0)




9| VDFM vs4=vs4,vs4,fcr

9| VDFM vs5=vs5,vs5,fcr

9| LRVS vs8=vs2

9| VDFMA vs8=vs8,vs9,vs9,fcr

9| LRVS vs9=vs3

9| VDFMA vs9=vs9,vs10,vs10,fcr

9| VDFSQRT vs10=vs0,fcr

9| VDFSQRT vs11=vs1,fcr

9| VSTQD @V.z[].rns1.2(gr9,gr26,0)=vs7

9| VSTQD @V.z[].rns1.2(gr9,gr25,0)=vs6

0| AI gr9=gr9,64

...

0| BCT ctr=CL.115,,100,0

subroutine sub(x,y,z) integer i real*8 x(*),y(*),z(*) CALL ALIGNX(16, x(1)) CALL ALIGNX(16, y(1)) CALL ALIGNX(16, z(1)) do i=1,1000 z(i)=sqrt(x(i)*x(i)+y(i)*y(i)) enddo return end

Command: xlf90 –O3 –qhot –qstrict –qarch=pwr7 –qsimd –qvecnvol py.f

38




SIMDization Tuning

memory accesses have non-vectorizable alignment.

§ Use __attribute__((aligned(n)) to set data alignment

§ Use __alignx(16, a) to indicate the data alignment to the compiler

§ Use -qassert=refalign if all references are naturally aligned

§ Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

§ Use fewer pointers when possible

§ Use #pragma independent if it has no loop carried dependency

§ Use #pragma disjoint (*a, *b) if a and b are disjoint

§ Use restrict keyword or compiler option –qrestrict

User actions Transformation report

Loop was SIMD vectorized

§ Use #pragma simd_level(10) to force the compiler to do SIMDization It is not profitable

to vectorize

39




SIMDization Tuning

memory accesses have non-vectorizable strides

§ Loop interchange for stride-one accesses, when possible

§ Data layout reshape for stride-one accesses

§ Higher optimization to propagate compile known stride information

§ Stride versioning

§ Do statement splitting and loop splitting

User actions Transformation report

either operation or data type is not suitable for SIMD vectorization.

§ Convert while-loops into do-loops when possible

§ Limited use of control flow in a loop

§ Use MIN, MAX instead of if-then-else

§ Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

40



MASS and MASSV Libraries § Libraries of mathematical routines tuned for optimal performance on

various POWER architectures – General implementation tuned for POWER – Specific implementations tuned for specific POWER processors (pwr5, pwr6, pwr7)

§ Compiler will automatically insert calls to MASS/MASSV routines at higher optimization levels

– Users can add explicit calls to the library


for (i=0;i<n;i++) {

b[i]=sqrt(a[i]);

} __vsqrt_P7(b,a,n);

Loop vectorization was performed.

Transformation report




MASS enhancements and Auto-vectorization on POWER7 § MASS enhancements for POWER7

– POWER7 vector MASS library (libmassvp7.a) •  Internally exploit VSX instructions

SP: average speedup of 1.99 vs Power5 MASSV DP: average speedup of 1.27 vs Power5 MASSV

– POWER7 SIMD MASS library (libmass_simdp7.a) •  Tuned math routines operating on vector data types • Over 35 frequently used mathematical functions • Both simple and double precision •  To be used in conjunction with explicit SIMD programming

§ Auto-vectorization at optimization level –O3 or above

§ -qstrict=vectorprecision to maintain precision over all loop iterations

42



ESSL and PESSL Libraries § ESSL (Engineering and Scientific Subroutine Library) is a collection of

high performance mathematical subroutines providing a wide range of functions for many common scientific and engineering applications.

§ ESSL Symmetric Multi-Processing (SMP) Library provides thread-safe versions of the ESSL subroutines for use on all SMP processors. In addition, some of these subroutines support the shared memory parallel processing programming model.

§ Parallel ESSL is a scalable mathematical subroutine library for standalone clusters or clusters of servers connected via a switch and running AIX and Linux. Parallel ESSL supports the Single Program Multiple Data (SPMD) programming model using the Message Passing Interface (MPI) library.





Software-controlled data prefetching for POWER7

§ Software control over POWER7 prefetch engine, supporting up to 12 data streams

§ Fine grained software controlled data prefetch including stream type, stream length, stream stride, prefetch depth at optimization level -O3 -qhot or above

– More aggressive exploitation under option -qprefetch=aggressive

§ Global analysis for coarse grained prefetch engine control at optimization level -O5

44




Built-in functions for POWER7 data prefetching and cache control

§  Transient cache line touch void __dcbtt(void *address); void __dcbtstt (void * address);

§  Partial cache line touch void __partial_dcbt(void *address);

§  Stride-N stream prefetch void __protected_stream_stride(offset, stride, stream_ID);

§  Transient stream prefetch: void __transient_protected_stream_count_depth(unit_count, depth, stream_ID)

void __transient_unlimited_protected_stream_depth(prefetch_depth, stream_ID)

45




Example of POWER7 data prefetching

for (i=0; i< n; i++) { a[i] = b[i] + ...;

}

__protected_store_stream_set(FORWARD, &a, 11);

__protected_stream_count_depth(n*sizeof(double)/128, DEEPER, 11);

__protected_stream_set(FORWARD, &b, 0);

__transient_protected_stream_count_depth(n*sizeof(double)/128, DEEPER, 0);

__eieio();

__protected_stream_go();

Store stream prefetch for array a;

transient stream prefetch for array b Stream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

46




Loop Optimization

§ Traditional unimodular loop transformations for prefect regular loop nests

– Compiler loop transformations including loop fusion, loop distribution, unroll-and-jam, loop tiling, loop rerolling, loop collapsing, loop unrolling

– Compiler pragmas: unroll, stream_unroll, block_loop, unrollandfuse – Under the optimization level O3, O3 –qhot or above

§ Polyhedral framework for any loop nests – Provide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control –qhot=level=2 with -qsmp

•  Loop skewing, loop tiling for triangular loop shapes – Perform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 –qhot or above)

47




Polyhedral Loop Transformation Examples

§ Dependence analysis do i = 1,N do j = 1,i do k = i+1,N c(k) = c(j) + b(j,k)

§ Enable interchange of j and k loops to improve access locality for b

– Identifies independence of memory accesses to c

§ Affects all optimization levels that include -qhot

§ Loop transformations

§ Tiling of triangular matrix multiplication do i = 1,N do j = i+1,N do k = i+1,N c(j,i)+=a(k,i)*b(j,k)

§ Also allows transformation of imperfect loop nests

– Intervening code between loops

§ Only available at -qhot=level=2 48



Data Reorganization § Data reorganization transformations to reshape data layout to reduce memory

latency, enhance cache utilization and memory bandwidth.

§ Data reorganization transformations enabled at O5 – Data splitting – Data interleaving – Data transposing – Data merging – Data grouping – Data compressing – Data padding

§  -qlistfmt=xml=data (-qlistfmt=html=data) generates data reorganization transformation information in xml (html)

May 2013 Performance Programming with IBM XL Compilers and Libraries

Data Reorganization Report Seq # Type Phase Data

Name Category Region # Line # Description

1 ArraySplitting High Level Optimizer iplus 9

An array of a large aggregated data-type was split into multiple arrays of smaller data-types.

27 ArrayCoalescing High Level Optimizer net Global variables were

aggregated.

HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries 49




S[0].F0 S[0].F1 S[0].F2 S[0].F3

S[1].F0 S[1].F1 S[1].F2 S[1].F3

S[2].F0 S[2].F1 S[2].F2 S[2].F3

S[3].F0 S[3].F1 S[3].F2 S[3].F3

… … … …

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

…

…

…

…

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3] A[0][1] A[0][0]

A[1][2] A[1][3] A[1][1] A[1][0]

A[2][2] A[2][3] A[2][1] A[2][0]

A[3][2] A[3][3] A[3][1] A[3][0]

…

…

…

…

A[3] A[3][2] A[3][3] A[3][1] A[3][0] …

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

… … … …

…

…

…

…

…

A’[0][0]

A’[1][0]

A’[2][0]

A’[3][0]

A’[0][1]

A’[1][1]

A’[2][1]

A’[3][1]

A’[0][2]

A’[1][2]

A’[2][2]

A’[3][2]

A’[0][3]

A’[1][3]

A’[2][3]

A’[3][3]

…

…

…

…

… … … … …

Array splitting

Array merging Array transposing

Data locality, cache utilization 50




User Explicit Parallelization with OpenMP § Full OpenMP 3.0 implementation on C, C++ and Fortran

– Full OpenMP task parallelization – Privatization of Fortran descriptor-based arrays

§ Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

– Bypass expensive pthread key mechanism – pthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

§ Improved interaction between OpenMP and automatic SIMD

51




Automatic parallelization § Improvements to automatic parallelization

– More effective array data flow analysis for array privatization

– Automatic privatization of descriptor-based Fortran arrays – Runtime-dependence testing

§ SMP runtime improvements – Leverage of TLS on SMP runtime implementation

§ Enable automatic parallelization with the compiler option -qsmp

52




XLSMPOPTS Environment Variable for Runtime Tuning

§ XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

§ Some suboptions of interest: – spins and yields to define the behavior of idle threads – Thread binding using startproc and stride suboptions – new bind suboption on AIX, bind=SDL=<start resource>,<number of resources>,<stride>;bindlist=SDL=i0,i1,…,ix

– schedule to define the runtime scheduling algorithm used for parallel loops (static, dynamic, guided) • Note that the default schedule has changed from

runtime to auto in V11/V13

53




Inlining §  Single control knob to enable inlining

– Simplifies inlining control for programmer -qinline=level=X X=0..10 (default 5)

– Consistent across all languages and optimization levels – Previous mechanisms still available

§  User inline control with –qinline{+|-}<function_name>

§  inlining performed by –  C++ FE

•  Small methods •  inline and always_inline (explain when to use, impacts compile time) •  Some exception-handling capabilities •  Within a single source file

–  High-level Optimizer •  Larger methods •  Will inline within a source file (-O3, -qhot) and across source files (-O4, -O5, -qipa) •  Inline across languages (-O4, -O5, -qipa)

–  Low-level optimizer •  Larger methods •  Within a source file •  Enabled at -O2 and higher

54




Control over optimizations that may affect program results -qstrict suboptions

§ Aggressive optimization may affect the results of the program – Precision of floating-point computation – Handling of special cases of IEEE FP standard (INF, NAN, etc) – Use of alternate math libraries

§ -qstrict guarantees identical result to noopt, at the expense of optimization

– Suboptions allow fine-grain control over this guarantee – Examples:

-qstrict=precision Strict FP precision -qstrict=exceptions Strict FP exceptions -qstrict=ieeefp Strict IEEE FP implementation -qstrict=nans Strict general and computation of NANs -qstrict=order Do not modify evaluation order -qstrict=vectorprecision Maintain precision over all loop iterations

§ Can be combined: -qstrict=precision:nonans 55



Tips for Compiler Friendly Programming § Obey all language aliasing rules (avoid –qalias=noansi in C/C++)

§ Avoid unnecessary use of globals and pointers; use restrict keyword or compiler directives/pragmas to help the compiler do dependence and alias analysis

§ Use “const” for globals, parameters and functions whenever possible

§ Group frequently used functions into the same file (compilation unit) to expose compiler optimization opportunity (e.g., intra compilation unit inlining, instruction cache utilization)

§ Excessive hand-optimization such as unrolling and inlining may impede the compiler

§ Keep array index expressions as simple as possible for easy dependency analysis

§ Consider using the highly tuned MASS and ESSL libraries rather than custom implementations or generic libraries




Tips for POWER7 Optimizations § POWER7 exploitation

– POWER7 specific ISA exploitation under –qarch=pwr7 •  Extended FP register file •  64-bit population count, bit permutation, fixed point pipelined multiply, fix point select,

divide check for software divide assistance •  VMS/VSX •  Stride-N stream prefetch, partial cache line touch

– Scheduling and instruction selection under –qtune=pwr7

§ Automatic SIMDization – Use simd_level(0..10) pragma to exploit aggressive SIMDization – Use align attribute to force the compiler to align static data by 16-byte; use

MALLOCALIGN=16 to force OS to align malloced data by 16-byte; use alignx directive to tell the compile the alignment.

– Limited use of control flow – Limited use of pointers. Use independent_loop directive to tell the compiler a loop has no

loop carried dependency; use either restrict keyword or disjoint pragma to tell the compiler the references do not share the same physical storage whenever possible

– Limited use of stride accesses. Expose stride-one accesses whenever possible




Tips for POWER7 Optimizations § POWER7 aware loop transformations

– Loop distribution, unroll-and-jam, stream unrolling controlled by 12 streams on each core, shared by SMTs

– Loop blocking controlled by L2 cache size – Selection of SIMDization and vectorization controlled by the threshold;

use loop iteration directives to guide the compiler

§ Memory hierarchy optimization – Data prefetch

• Automatic data prefetch at O3 –qhot or above. •  -qprefetch=aggressive to enable aggressive data prefetch; • DSCR setting for the default data prefetching; • Enable DCBZ insertion on POWER7 IH • Partial cache line touch

– Data reorganization enabled at O5





XL Compiler Documentation § An information centre containing the documentation for the XL Fortran V14.1 and XL C/C++ V12.1 versions is available at: http://publib.boulder.ibm.com/infocenter/comphelp/v121v141/index.jsp for AIX compilers and http://publib.boulder.ibm.com/infocenter/lnxpcomp/v121v141/index.jsp for the LINUX compilers:

• Installation Guide • Getting Started with XL C/C++ • Compiler Reference • Language Reference

§ Whitepaper “Code optimization with the IBM XL Compilers” – http://www-01.ibm.com/support/docview.wss?uid=swg27005174

§ Whitepaper “Overview of the IBM XL C/C++ and XL Fortran Compiler Family” available at:

– http://www.ibm.com/support/docview.wss?uid=swg27005175

§ Please send any comments or suggestions on this information center or about the existing C, C++ or Fortran documentation shipped with the products to [email protected].

59



HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 60 60

Documents

HPC Workload Performance Tuning with IBM XL …spscicomp.org/.../03/gao-HPC-workload-performance-tuning-with-IBM... · HPC Workload Performance Tuning with ... Target AIX, Linux on