Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM...

Software Group

Compilation Technology

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Priya UnnikrishnanIBM Toronto Labpriyau@ca.ibm.comCASCON 2005

Software Group

Overview

Parallelization in IBM XL compilers

Outlining

Automatic parallelization

Cost analysis

Controlled parallelization

Future work

Software Group

Parallelization

IBM XL compilers support Fortran 77/90/95, C and C++

Implements both OpenMP and Auto-parallelization.

Both target SMP (shared memory parallel) machines

Non-threadsafe code generated by default

– Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code

Software Group

Parallelization options

-qsmp=noopt Parallelizes code with minimal optimization to allow for better debugging of OpenMP applications.

-qsmp=omp Parallelizes code containing OpenMP directives

-qsmp=auto Automatically parallelizes loops

-qsmp=noauto No auto-parallelization. Processes IBM and OpenMP parallel directives.

Software Group

Outlining

Parallelization transformation

Software Group

Outlininglong main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then _xlsmpParallelDoSetup_TPO(2208,

&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)

endif return main;}

int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… }}

Subroutine void main@OL@1( unsigned @LB, unsigned @UB){ @CIV1 =0; do{ a[]0[(long)@LB + CIV1] = const; …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

Runtime call

Outlined routine

Software Group

SMP parallel runtime

_xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..)

main@OL@1(30,39)main@OL@1(0,9) main@OL@1(10,19) main@OL@1(20,29)

The outlined function is parameterized – can be invoked for different ranges in the iteration space

Software Group

Auto-parallelization

Integrated framework for OpenMP and auto-parallelization

Auto-parallelization is restricted to loops.

Auto-parallelization is done in the link step when possible.

This allows us to perform various interprocedural analysis and optimizations before automatic parallelization

Software Group

Auto-parallelization transformation

int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}

int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… }}

Outlining

Software Group

We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!!

int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }}

+Outlining

int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }}

Software Group

Pre-parallelization phase

Loop Normalization (normalize countable loops)

Scalar privatization

Array privatization

Reduction variable analysis

Loop interchange (that helps parallelization)

Software Group

Cost Analysis

Automatic parallelization tests

– Dependence analysis : Is it safe to parallelize ??

– Cost analysis : Is it worthwhile to parallelize ??

Cost analysis: Estimates the total workload of the loop

LoopCost = ( IterationCount * ExecTimeOfLoopBody )

Cost known at compile time – trivial

Runtime cost analysis is more complex

Software Group

Conditional Parallelization

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ _xlsmpParallelDoSetup_TPO(2208,

} else main@OL@1(0,0,(unsigned)n,0) endif endif return main;}

Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

Runtime check

Software Group

Runtime cost analysis challenges

Runtime checks should be

– Light weight : should not introduce large overhead in applications that are mostly serial

– Overflow problems : leads to incorrect decision – costly!!

loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* …

– Restricted to integer operations

– Should be accurate

Balance all the above factors

Software Group

Runtime dependence test

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(<deptest> && loop_cost>threshold){ _xlsmpParallelDoSetup_TPO(2208,

} else main@OL@1(0,0,(unsigned)n,0) endif endif return main;}

Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

Runtime dependence

Work by Peng Zhao

Software Group

swim wupwise mgrid applu lucas mesa art equake ammp apsi facerec fma3d sixtrack

SPEC2000FP auto-par performance1 Proc : -0.5%

2 Proc : 8%

Software Group

Cost analysis selects big loops

– Selection is not enough

– Parallel performance dependent on

( amount of work + number of processors used)

– Using large number of processors for a small loop huge degradations !!

Software Group

8 16 32 48 64Processors

galgel (SPECOMPM 2001) performanceMeasured on a 64-way Power5 processor

Small is good !!!

Software Group

Introduce another runtime parameter IPT (minimum iterations per thread)

The IPT is passed to the SMP runtime

SMP runtime limits the number of threads working on the parallel loop based on IPT

IPT = function( loop_cost, mem access info .. )

Software Group

Controlled Parallelization

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) _xlsmpParallelDoSetup_TPO(2208,

&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,IPT)

endif } else main@OL@1(0,0,(unsigned)n,0) } return main;}

int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }} Subroutine void main@OL@1(

…… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

Runtime parameter

Software Group

SMP parallel runtime

_xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..IPT)

threadsUsed = IterCount/IPT

if (threadsUsed > threadsAvailable)

threadsUsed = threadsAvailable

Software Group

Controlled parallelization for OpenMP

Improves performance and scalability

Allows fine grained control at loop level granularity

Can be applied to OpenMP loops as well

Adjust number of threads when ENV variable OMP_DYNAMIC is turned on.

Issues with threadprivate data

Encouraging results in galgel

Software Group

8 16 32 48 64Processors

galgel (SPECOMPM 2001) performance

no controlled par controlled par

Measured on a 64-way Power5 processor

Software Group

Future work

Improve cost analysis algorithm and fine tune heuristics

Implement interprocedural cost analysis.

Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability

Implement interprocedural dependence analysis

Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM...

Documents

parallelization strategy

PARALLELIZING PATH EXPLORATION AND OPTIMIZING …

Parallelizing Programs

AAAI 2013 Conference, Bellevue, WA AAAI 2013 © 2013 IBM Corporation Resolution and Parallelizability: Barriers to the Efficient Parallelization of SAT

Parallelizing Highly Dynamic N-Body Simulations

MIPSpro Auto-Parallelizing Option Programmer’s Guide Auto-Parallelizing Option.pdf · MIPSpro™ Auto-Parallelizing Option Programmer’s Guide ... MIPSpro™ Auto-Parallelizing

Parallelizing CI using Docker Swarm-Mode

A Many-core Parallelizing Processor

Lecture 11 - parallelizing Compilers · Lecture 11 Parallelizing Compilers. Prof. Saman Amarasinghe, MIT. 2 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence

Experiments with Parallelizing Tribology Simulations

Shared Memory Parallelization

Parallelizing Security Checks on Commodity Hardware

Parallelization & Multicore

The need for parallelization Challenges towards effective parallelization A multilevel parallelization framework for BEM: A compute intensive application

Parallelizing LINQ program for GPGPU

Cluster Computing Applications Project Parallelizing BLAST

OSCAR Parallelizing CompilerOSCAR Parallelizing …...OSCAR Parallelizing CompilerOSCAR Parallelizing Compiler and API for Low Powerand API for Low Power High Performance Multicores

Parallelization - cons.mit.edu

A Multi-core Parallelizing Compiler for Low-Power …...Electronics Project (Leader: Prof.Kasahara) Supercomputers Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L:

Automatic Parallelization