Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM...

Preview:

Citation preview

Software Group

© 2005 IBM Corporation

Compilation Technology

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Priya UnnikrishnanIBM Toronto Labpriyau@ca.ibm.comCASCON 2005

Software Group

© 2005 IBM CorporationOctober 2005

Overview

Parallelization in IBM XL compilers

Outlining

Automatic parallelization

Cost analysis

Controlled parallelization

Future work

Software Group

© 2005 IBM CorporationOctober 2005

Parallelization

IBM XL compilers support Fortran 77/90/95, C and C++

Implements both OpenMP and Auto-parallelization.

Both target SMP (shared memory parallel) machines

Non-threadsafe code generated by default

– Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code

Software Group

© 2005 IBM CorporationOctober 2005

Parallelization options

-qsmp=noopt Parallelizes code with minimal optimization to allow for better debugging of OpenMP applications.

-qsmp=omp Parallelizes code containing OpenMP directives

-qsmp=auto Automatically parallelizes loops

-qsmp=noauto No auto-parallelization. Processes IBM and OpenMP parallel directives.

Software Group

© 2005 IBM CorporationOctober 2005

Outlining

Parallelization transformation

Software Group

© 2005 IBM CorporationOctober 2005

Outlininglong main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then _xlsmpParallelDoSetup_TPO(2208,

&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)

endif return main;}

int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… }}

Subroutine void main@OL@1( unsigned @LB, unsigned @UB){ @CIV1 =0; do{ a[]0[(long)@LB + CIV1] = const; …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

+

Runtime call

Outlined routine

Software Group

© 2005 IBM CorporationOctober 2005

SMP parallel runtime

_xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..)

main@OL@1(30,39)main@OL@1(0,9) main@OL@1(10,19) main@OL@1(20,29)

The outlined function is parameterized – can be invoked for different ranges in the iteration space

Software Group

© 2005 IBM CorporationOctober 2005

Auto-parallelization

Integrated framework for OpenMP and auto-parallelization

Auto-parallelization is restricted to loops.

Auto-parallelization is done in the link step when possible.

This allows us to perform various interprocedural analysis and optimizations before automatic parallelization

Software Group

© 2005 IBM CorporationOctober 2005

Auto-parallelization transformation

int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}

+

int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… }}

Outlining

Software Group

© 2005 IBM CorporationOctober 2005

We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!!

int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }}

+Outlining

int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }}

Software Group

© 2005 IBM CorporationOctober 2005

Pre-parallelization phase

Loop Normalization (normalize countable loops)

Scalar privatization

Array privatization

Reduction variable analysis

Loop interchange (that helps parallelization)

Software Group

© 2005 IBM CorporationOctober 2005

Cost Analysis

Automatic parallelization tests

– Dependence analysis : Is it safe to parallelize ??

– Cost analysis : Is it worthwhile to parallelize ??

Cost analysis: Estimates the total workload of the loop

LoopCost = ( IterationCount * ExecTimeOfLoopBody )

Cost known at compile time – trivial

Runtime cost analysis is more complex

Software Group

© 2005 IBM CorporationOctober 2005

Conditional Parallelization

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ _xlsmpParallelDoSetup_TPO(2208,

&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)

} else main@OL@1(0,0,(unsigned)n,0) endif endif return main;}

int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}

Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

+

Runtime check

Software Group

© 2005 IBM CorporationOctober 2005

Runtime cost analysis challenges

Runtime checks should be

– Light weight : should not introduce large overhead in applications that are mostly serial

– Overflow problems : leads to incorrect decision – costly!!

loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* …

– Restricted to integer operations

– Should be accurate

Balance all the above factors

Software Group

© 2005 IBM CorporationOctober 2005

Runtime dependence test

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(<deptest> && loop_cost>threshold){ _xlsmpParallelDoSetup_TPO(2208,

&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,0)

} else main@OL@1(0,0,(unsigned)n,0) endif endif return main;}

int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }}

Subroutine void main@OL@1( …… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

+

Runtime dependence

Work by Peng Zhao

Software Group

© 2005 IBM CorporationOctober 2005

-20

-10

0

10

20

30

40

50

%Im

pro

ve

me

nt

(-O

5 -

qs

mp

)

swim wupwise mgrid applu lucas mesa art equake ammp apsi facerec fma3d sixtrack

SPEC2000FP auto-par performance1 Proc : -0.5%

2 Proc : 8%

Software Group

© 2005 IBM CorporationOctober 2005

Controlled parallelization

Cost analysis selects big loops

Controlled parallelization

– Selection is not enough

– Parallel performance dependent on

( amount of work + number of processors used)

– Using large number of processors for a small loop huge degradations !!

Software Group

© 2005 IBM CorporationOctober 2005

0

50

100

150

200

250

Ex

ec

uti

on

tim

e (

se

c)

8 16 32 48 64Processors

galgel (SPECOMPM 2001) performanceMeasured on a 64-way Power5 processor

Small is good !!!

Software Group

© 2005 IBM CorporationOctober 2005

Controlled parallelization

Introduce another runtime parameter IPT (minimum iterations per thread)

The IPT is passed to the SMP runtime

SMP runtime limits the number of threads working on the parallel loop based on IPT

IPT = function( loop_cost, mem access info .. )

Software Group

© 2005 IBM CorporationOctober 2005

Controlled Parallelization

long main{}{ @_xlsmpEntry0 =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) _xlsmpParallelDoSetup_TPO(2208,

&main@OL@1,0,n,5,0,@_xlsmpEntry0,0,0,0,0,0,IPT)

endif } else main@OL@1(0,0,(unsigned)n,0) } return main;}

int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… }} Subroutine void main@OL@1(

…… @CIV1 = @CIV1 + 1; }while((unsigned)@CIV1 < (@UB-@LB)); return;}

+

Runtime parameter

Software Group

© 2005 IBM CorporationOctober 2005

SMP parallel runtime

_xlsmpParallelDoSetup_TPO(&main@OL@1,0,n ..IPT)

{

threadsUsed = IterCount/IPT

if (threadsUsed > threadsAvailable)

threadsUsed = threadsAvailable

…..

…..

}

Software Group

© 2005 IBM CorporationOctober 2005

Controlled parallelization for OpenMP

Improves performance and scalability

Allows fine grained control at loop level granularity

Can be applied to OpenMP loops as well

Adjust number of threads when ENV variable OMP_DYNAMIC is turned on.

Issues with threadprivate data

Encouraging results in galgel

Software Group

© 2005 IBM CorporationOctober 2005

0

50

100

150

200

250

Ex

ec

uti

on

tim

e (

se

c)

8 16 32 48 64Processors

galgel (SPECOMPM 2001) performance

no controlled par controlled par

Measured on a 64-way Power5 processor

Software Group

© 2005 IBM CorporationOctober 2005

Future work

Improve cost analysis algorithm and fine tune heuristics

Implement interprocedural cost analysis.

Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability

Implement interprocedural dependence analysis

Recommended