44
06/18/22 WSU High Performance Computing Cente r (HiPeCC) 1 Auto-Parallelizing Auto-Parallelizing Option Option John Matrow, M.S. System Administrator/Trainer

Auto-Parallelizing Option

Embed Size (px)

DESCRIPTION

Auto-Parallelizing Option. John Matrow, M.S. System Administrator/Trainer. Outline. Compiler * Options * Output Incomplete Optimization * Does not detect a loop is safe to parallelize * Parallelizes the wrong loop * Unnecessarily parallelizes a loop Strategies for Assisting APO. - PowerPoint PPT Presentation

Citation preview

Page 1: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

1

Auto-Parallelizing OptionAuto-Parallelizing OptionJohn Matrow, M.S.

System Administrator/Trainer

Page 2: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

2

OutlineOutline

Compiler* Options* Output

Incomplete Optimization* Does not detect a loop is safe to

parallelize* Parallelizes the wrong loop* Unnecessarily parallelizes a loop

Strategies for Assisting APO

Page 3: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

3

Auto-Parallelizing Option (APO)Auto-Parallelizing Option (APO)

The MIPSpro Auto-Parallelizing Option (APO) from SGI is used to automatically detect and exploit parallelism in Fortran 77, Fortran 90, C and C++ programs.

Page 4: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

4

SGI MIPSpro compilersSGI MIPSpro compilers

APOIPA (interprocedural analysis)LNO (loop nest optimization)

Page 5: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

5

SyntaxSyntax

f77/cc: -apo[{list|keep}] [-mplist] [-On]

f90/CC: -apo[{list|keep}] [-On]

Page 6: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

6

SyntaxSyntax

-apo list: produce a .l file, a listing of those parts of the program that can run in parallel and those that cannot

-apo keep: produce .l, .w2c.c, .m and .anl files. Do not use with –mplist

-mplist: Generate the equivalent program for f77 in .w2f.f file or for c in a .w2c.c file

-On: optimization level, 3= aggressive (recommended)

Page 7: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

7

LinkLink

If you link separately, you must have one of the following in the command line:

The –apo flagThe –mp option

Page 8: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

8

Interprocedural Analysis (IPA)Interprocedural Analysis (IPA)

Procedure inliningIdentification of global constantsDead function eliminationDead variable eliminationDead call eliminationInterprocedural alias analysisInterprocedural constant propagation

Page 9: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

9

Loop Nest Optimization (LNO)Loop Nest Optimization (LNO)

Loop interchangeLoop fusionLoop fissionCache blocking and outer loop unrolling

LNO runs when you use the –O3 option

Page 10: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

10

Sample sourceSample source

SUBROUTINE sub(arr, n) REAL*8 arr(n) DO i = 1, n arr(i) = arr(i) + arr(i-1) END DO DO i = 1, n arr(i) = arr(i) + 7.0 CALL foo(a) END DO DO I = 1, n arr(i) = arr(i) + 7.0 END DOEND

Page 11: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

11

Sample APO listingSample APO listing

Parallelization log for Subprogram sub_3: Not Parallel

Array dependence from arr on line 4 to arr on line 4.

6: Not ParallelCall foo on line 8

10: PARALLEL (Auto) _mpdo_sub_1

Page 12: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

12

Sample source listingSample source listing

C PARALLEL DO will be converted to SUBROUTINE _mpdo_sub_1

C$OMP PARALLEL DO private (i), shared (a)DO I = 1, 10000, 1

a(I) = 0.0END DO

Page 13: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

13

Running Your ProgramRunning Your Program

Environment variable used to specify the number of threads: OMP_NUM_THREADS

Example:setenv OMP_NUM_THREADS 4

Page 14: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

14

Running Your ProgramRunning Your Program

Environment variable used to allow a dynamic number of threads to be used (as available): OMP_DYNAMIC

Example:setenv OMP_DYNAMIC FALSE

Default: TRUE

Page 15: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

15

Incomplete OptimizationIncomplete Optimization

Does not detect a loop is safe to parallelize Parallelizes the wrong loop Unnecessarily parallelizes a loop

Page 16: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

16

Failing to Parallelize Safe LoopsFailing to Parallelize Safe Loops

Does NOT parallelize loops containing: Data dependencies* Function calls GO TO Statements* Problematic Array Subscripts Conditionally Assigned Temporary Nonlocal

Variables Unanalyzable Pointer Usage (C/C++)

*not discussed here

Page 17: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

17

Function CallsFunction Calls

You can tell APO to ignore dependencies of function calls by using

Fortran:C*$* ASSERT CONCURRENT CALL

C/C++:#pragma concurrent call

Page 18: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

18

Problematic Array SubscriptsProblematic Array Subscripts

Too complicated: Indirect array references

A(IB(I)) = . . . Unanalyzable subscripts

Allowable elements: literal constants, variables, product, sum, difference

Rely on hidden knowledgeA(I) = A(I+M)

Page 19: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

19

Conditionally Assigned Conditionally Assigned Temporary Nonlocal VariablesTemporary Nonlocal VariablesSUBROUTINE S1(A,B)COMMON TDO I = 1, N

IF B(I) THENT = . . .A(I) = A(I) + T

END OFEND DOCALL S2()

END

Page 20: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

20

Unanalyzable Pointer Usage Unanalyzable Pointer Usage (C/C++)(C/C++) Arbitrary pointer dereferences Arrays of arrays

Use p[n][n] instead of **p Loops bounded by pointer comparisons Aliased parameter information

Use __restrict type qualifier to say arrays do not overlap

Page 21: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

21

Parallelizing the Wrong LoopParallelizing the Wrong Loop

Inner Loops Small Trip Counts Poor Data Locality

Page 22: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

22

Inner LoopsInner Loops

APO tries to parallelize the outermost loop, after possibly interchanging loops to make a more promising one outermost

If the outermost loop attempt fails, APO parallelizes an inner loop if possible

Inner loop parallelized probably because of “Failing to Parallelize Safe Loops” discussed earlier

Probably advantageous to modify code so the outermost loop is the one parallelized

Page 23: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

23

Small Trip CountsSmall Trip Counts

Small trips counts generally run faster when they are not parallelized

Use AssertionC*$* ASSERT DO PREFER #pragma prefer

Use manual parallelization directives

Page 24: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

24

Poor Data LocalityPoor Data Locality

DO I = 1, N. . .A(I)

END DODO I = N, 1, -1. . .A(I). . .

END DO

Page 25: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

25

Poor Data LocalityPoor Data Locality

DO I = 1, NDO J = 1, N

A(I,J) = B(J,I) + . . .END DO

END DO

DO I = 1, NDO J = 1, N

B(I,J) = A(J,I) + . . .END DO

END DO

Page 26: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

26

Incurring Unnecessary Incurring Unnecessary Parallelization OverheadParallelization OverheadUnknown Trip CountsNested parallelism

Page 27: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

27

Unknown Trip CountsUnknown Trip Counts

If the trip count is not known (and sometimes even if it is), APO parallelizes the loop conditionally

It generates code for both a parallel and a sequential version

APO can avoid running in parallel if the loops turns out to have a small trip count

Choice also includes number of processors available, overhead cost, code inside loop

Page 28: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

28

Nested ParallelismNested Parallelism

SUBROUTINE CALLERDO I = 1, N

CALL SUBEND DO

END

SUBROUTINE SUBDO I = 1, N. . .END DO

END

Page 29: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

29

Strategies for Assisting APOStrategies for Assisting APO

Modify code to avoid coding practices that will not analyze well

Manual parallelization options [OpenMP]Use APO directives to give APO more

information about code

Page 30: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

30

Compiler Directives for Compiler Directives for Automatic ParallelizationAutomatic Parallelization

C*$* [NO] CONCURRENTIZE C*$* ASSERT DO (CONCURRENT|SERIAL) C*$* ASSERT CONCURRENT CALL C*$* ASSERT PERMUTATION (array_name) C*$* ASSERT DO PREFER (CONCURRENT|SERIAL)

Page 31: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

31

Compiler DirectivesCompiler Directives

The following affect compilation even if –apo is not specified:C*$* ASSERT DO (CONCURRENT)C*$* ASSERT CONCURRENT CALLC*$* ASSERT PERMUTATION

-LNO:ignore_pragmas causes APO to ignore all directives, assertions and pragmas

Page 32: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

32

C*$* NO CONCURRENTIZEC*$* NO CONCURRENTIZE

Place inside subroutinePlace outside subroutine to affect all

subroutines C*$* CONCURRENTIZE used to overrideC*$* NO CONCURRENTIZE placed outside of it

Page 33: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

33

C*$* ASSERT DO (CONCURRENT)C*$* ASSERT DO (CONCURRENT)

Tells APO to ignore array dependenciesApplying to inner loop may cause loop to

be made outermost by loop interchangeDoes not affect CALLIgnored if obvious real dependencies foundIf multiple loops can be parallelized, it

causes APO to prefer loop immediately following the assertion

Page 34: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

34

C*$* ASSERT DO (SERIAL)C*$* ASSERT DO (SERIAL)

Do not parallelize the loop following the assertion

APO may parallelize another loop in the same nest

The parallelized loop may be either inside or outside the designated sequential loop

Page 35: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

35

C*$* ASSERT CONCURRENT CALLC*$* ASSERT CONCURRENT CALL

Applies to the loop that immediately follows it and to all loops nested inside that loop

A subroutine inside the loop cannot read from a location that is written to during another iteration (shared)

A subroutine inside the loop cannot write to a location that is read from or written to during another iteration (shared)

Page 36: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

36

C*$* ASSERT PERMUTATIONC*$* ASSERT PERMUTATION

C*$* ASSERT PERMUTATION (array_name) tells APO that array_name is a permutation array: Every element of the array has distinct value

The array can thus be used for indirect addressing

Affects every loop in subroutine, even those appearing ahead of it

Page 37: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

37

C*$* ASSERT DO PREFERC*$* ASSERT DO PREFER

C*$* ASSERT DO PREFER (CONCURRENT) instructs APO to parallelize the following loop if it is safe to do so

With nested loops, if it is not safe, APO uses heuristics to choose among loops that are safe

If applied to inner loop, APO may make it the outer loop

If applied to multiple loops, APO uses heuristics to choose one of the specified loops

Page 38: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

38

C*$* ASSERT DO PREFERC*$* ASSERT DO PREFER

C*$* ASSERT DO PREFER (SERIAL) is essentially the same asC*$* ASSERT DO (SERIAL)

Used in cases with small trip counts

Used in cases with poor data locality

Page 39: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

39

Example 1: AddOpac.fExample 1: AddOpac.f

do nd=1,ndust if( lgDustOn1(nd) ) then do i=1,nupper

dstab(i) = dstab(i) + dstab1(i,nd) * dstab3(nd) dstsc(i) = dstsc(i) + dstsc1(i,nd) * dstsc2(nd) end do

endifend do

408: Not Parallel Array dependence from DSTAB on line 412 to

DSTAB on line 412. Array dependence from DSTSC on line 413 to

DSTSC on line 413.

Page 40: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

40

Example 1: AddOpac.fExample 1: AddOpac.fC*$* ASSERT DO CONCURRENT before outer DO resulted in:

DO ND = 1, 20, 1 IF(LGDUSTON3(ND)) THENC PARALLEL DO will be converted to SUBROUTINE __mpdo_addopac_10C$OMP PARALLEL DO if(((DBLE(__mp_sug_numthreads_func$()) *((DBLE(C$& __mp_sug_numthreads_func$()) * 1.23D+02) + 2.6D+03)) .LT.((DBLE(C$& NUPPER0) * DBLE((__mp_sug_numthreads_func$() + -1))) * 6.0D00))),C$& private(I6), shared(DSTAB2, DSTABUND0, DSTAB3, DSTSC2, DSTSC3, ND,C$& NUPPER0) DO I6 = 1, NUPPER0, 1 DSTAB2(I6) = (DSTAB2(I6) +(DSTABUND0(ND) * DSTAB3(I6, ND))) DSTSC2(I6) = (DSTSC2(I6) +(DSTABUND0(ND) * DSTSC3(I6, ND))) END DO ENDIF END DO

Page 41: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

41

Example 2:BiDiag.fExample 2:BiDiag.f

135: Not Parallel Array dependence from DESTROY on line 166 to

DESTROY on line 137. Array dependence from DESTROY on line 166 to

DESTROY on line 144. Array dependence from DESTROY on line 174 to

DESTROY on line 166. Array dependence from DESTROY on line 166 to

DESTROY on line 166. Array dependence from DESTROY on line 144 to

DESTROY on line 166. Array dependence from DESTROY on line 137 to

DESTROY on line 166. Array dependence from DESTROY on line 166 to

DESTROY on line 174.<more of same>

Page 42: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

42

Example 2:BiDiag.fExample 2:BiDiag.f

C$OMP PARALLEL DO PRIVATE(ns, nej, nelec, max, ratio) do i=IonLow(nelem),IonHigh(nelem)-1. . . C$OMP CRITICAL destroy(nelem,max) = destroy(nelem,max) + 1 PhotoRate(nelem,i,ns,1) * ` 1

vyield(nelem,i,ns,nej) * ratioC$OMP END CRITICAL

Page 43: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

43

Example 3: ContRate.fExample 3: ContRate.f 78: Not Parallel Scalar dependence on XMAXSUB. Scalar XMAXSUB without unique last value. Scalar FREQSUB without unique last value. Scalar OPACSUB without unique last value.

Solution: same as previous example

Page 44: Auto-Parallelizing Option

04/19/23WSU High Performance Computing Center (HiPeCC)

44

ExercisesExercises

Copy ~jmatrow/openmp/apo*.f Compile and examine the .list fileEach program requires one change

apo1.f – Assertion neededapo2.f – OpenMP directive needed