The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Optimisation progress for UM7.8 Ilia Bermous

The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology

Optimisation progress for UM7.8

Ilia Bermous7 April 2011

Thank you to Joerg Henrichs, Martin Dix and Mike Naughton for some help and advices during the work

2

Description of global forecast test jobDescription of global forecast test job

Global model with N320L70 resolution

Based on Fabrizio’s xazje job (which came from forecast step of Chris’s APS1 ACCESS-G development suite)

24 hour integration with ~30GB output

~3100-3250 GCR iterations per run (3162 with 7.5 and 3223 with 7.8) for 120 time steps.

Timing results are given in terms of Elapsed CPU Time and Elapsed Wallclock Time from internal UM model timers as reported in UM job output for each run.

All runs used Mike’s version of UM run script; this script is very simple and flexible for these kinds of tasks.

3

UM7.5 best performance results on Solar UM7.5 best performance results on Solar

Used software Intel11.0.083 compiler OpenMPI mpi/sun-8.2 library

UM model environment settings and source change Joerg’s byte swapping procedure Q_POS_METHOD=5 for the improved QPOS algorithms Lustre file system striping

Best elapsed times (in sec) with decomposition of 20x24 => 480 cores (i.e. under 500 cores) Full I/O: (475; 497) (484; 513) (492; 519)

NO I/O: (330; 335) (306; 311) (310; 315)

4

Major UM7.8 developments for performance improvement

Major UM7.8 developments for performance improvement

Asynchronous parallel I/O (requires OpenMP) This new feature is activated at both build and run stage Only works with OpenMP – requires UMUI “Use OpenMP” option in the “User

Information and Submit Method => Job submission method” panel is selected

PMSL revised algorithm (Jacobi algorithm) A revised algorithm based on a Jacobi solver is introduced and the number

if iterations increased. Even with more iterations this new method is cheaper and scales at higher node counts.

Optimisation for FILL_EXTERNAL_HALOS resulted in ~5% reduction in fill_external_halos routine runtime cost

Improved QPOS algorithms All new versions are significantly quicker at scale (~10%) but require

scientific validation as the science is altered somewhat. The "level" method has been validated in PS25 and is now being used in the global model at the Met Office.

5

Summary of attempts made for UM7.8 Summary of attempts made for UM7.8

OpenMP usage with Intel compiler and 1 thread on Solar

OpenMP usage with Intel compiler and 2 threads on Solar

OpenMP usage with SunStudio compiler and 2 threads on Solar

OpenMP usage with Intel compiler and 2 threads on NCI system

OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar

6



Problems resolved and reportedFound a number of cases in the sources for inconsistent usage

of allocate/deallocate statements and IF block logic => reported to the UM developers

Significant impact on the performance if TMPDIR is used and modified in the UM scripts => reported to the developers

A couple of missing environment variables such as OMP_NUM_THREADS and OMP_STACKSIZE should be set by the UMUI scripts if multithreading is used

A run time crash problem with the usage of Intel11.0.083 compiler is resolved by using Intel11.1.073, the most latest available compiler on our site

7

OpenMP usage with Intel compiler and 1 thread on Solar (cont #2)


Performance results (20x24x1, Lustre striping, FLUME_IOS_NPROC=8)

UM7.8 UM7.5 (without multithreading)

Full I/O No I/O Full I/O No I/O

370; 389 221; 225 484; 512 330; 335

362; 389 205; 210 475; 497 306; 311

389; 416 209; 212 492; 519 310; 315

228; 235 277; 285

Conclusions:

1. The full I/O case with UM7.8 runs over 20% faster than with UM7.5

2. UM7.8 without usage of I/O runs ~1.5 times faster than with UM7.5

3. There is no visible performance improvement in the I/O part: 179sec vs 186sec

4. In red are results from runs to compare performance section by section

8



Performance comparison between top 6 sections for UM7.8 and UM7.5 without usage of I/O

UM7.8 UM7.5

PE_Helmholtz 68.15 PE_Helmholtz 70.76

SL_Full_wind 31.35 ATM_STEP 52.30

ATM_STEP 26.46 SL_Full_wind 31.19

SL_Thermo 21.52 SL_Thermo 27.20

READDUMP 13.53 READDUMP 12.67

Atmos_Physics2 9.55 NI_filter_Ctl 15.63

Conclusions:

1. Comparing the top sections the major performance improvements are coming from

ATM_STEP (25.74sec), NI_filter_Ctl (9.59sec) and SL_Thermo (5.68sec) which gives in total of 41.01sec

9

UM7.8 performance comparison: full I/O, 20x24 decomposition and Lustre striping UM7.8 performance comparison: full I/O, 20x24 decomposition and Lustre striping

EXE1 – Intel11.1.073, -openmp, 1 thread, FLUME_IOS_NPROC=8, buffer_size=6000 (results from the previous slide)

EXE2 – UMUI standard building procedure using Intel11.0.083 with the “safe” level of optimisation (UMUI build job xbauk)

EXE3 – Intel11.1.073, bld.cfg is based on Imtiaz’s version without

“-WB -warn all -warn nointerfaces -align all” (due to Intel compiler problems)

Conclusion:

1. Usage of OpenMP with a single thread and parallel I/O with FLUME_IOS_NPROC=8 does NOT provide any performance advantage

EXE1 EXE2 EXE3

370; 389 381; 411 370; 396

362; 389 362; 395 369; 397

389; 416 366; 392 382; 406

10



The same slow performance issue found and investigated in detail for UM7.5 and reported in August 2010 still exists and the issue has not been addressed by Intel at all Monitoring execution of the models sometimes the job starts to

run fast for the first 10-15 steps then it slows down significantly, sometimes this may happen from the start of a run

Elapsed times for a 14x18 decomposition and 2 threads per MPI process with 504 cores and without I/O are (4915sec; 4921sec)in comparison with (360sec; 366sec) using 14x18x1 without I/O

Conclusions: this long standing problem must be addressed by Intel

At the moment it is not the most critical issue in getting asynchronous parallel I/O functionality with UM7.8

11



Problems resolved and reported Usage of POINTER INTENT attributes which is not supported by Fortran

standard => used a work around, reported problem to the UM development team

Multithreading performance results UM performance using 2 threads per MPI process is better than without

OpenMP, but scaling is very poor (Lustre striping was not used):14x18x2threads + FLUME_IOS_NPROC=4 => 508 cores => 753sec14x18x1thread + FLUME_IOS_NPROC=0 => 252 cores => 794secNote:

- Date command output was used to calculate elapsed times- Several runs using different run configurations such as 20x24x1 and 16x32x1

had crash problems, the nature of these problems have not been investigated

- Usage of different optimisation options such as-O3, -O5, –xtarget=native, –xarch=native, –dalign, -g does not make any visible impact on the performance results

12



The same slow performance issue as on Solar does exist on NCI system using Intel11.1.073 compiler and openmpi1.4.3 library:

(3565sec; 3703sec) Monitoring execution of the model: the job started to run slow from

the first time step

Usage of the latest Intel12.0.084 compiler Compilation crashes for a file, a work around to use “-O0” instead of

“-O2” recommended by Martin fixes the problem

Execution with 14x18 decomposition using a single thread crashes, this run time problem has not been investigated

13



Due to a slow performance issue with multithreading for the computational part, the main idea in this approach is to compile all UM7.8 sources excluding the “io_services” library without

usage of “-openmp” compilation option

to compile the “io_services” library with multithreading using the “-openmp” compilation option

UM7.8 major terms in relation to asynchronous parallel I/O: FLUME_IOS_NPROC – number of MPI tasks allocated to act as IO servers

IOS_Spacing – the gap between IO servers in MPI_COMM_WORLD (for optimal performance a node has no more than one IO server)

buffer_size – amount of data (MB) that each IO server can have outstanding

IOS_use_async_stash – use asynchronous communications to accelerate diagnostics output

IOS_use_async_dump – asynchronous DUMP output not currently available

14

OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #2)


Found problems/issues Using 14x18x2 configuration with FLUME_IOS_NPROC=8 and spacing of

8 (8 nodes are overcommitted) a run time error problem was produced

forrtl: severe (40): recursive I/O operation, unit 6, file unknown

work around: several write statements to produce similar diagnostic output have been commented out (as per Joerg’s message, Peter Kerney has reported this problem to Intel Support)

The main asynchronous parallel IO functionality due to the latest model development is available only if an MPI library allows that multiple threads can call MPI with no restrictions (MPI_THREAD_MULTIPLE), unfortunately a single threaded support (MPI_THREAD_SINGLE) is provided by our MPI library (OpenMPI), this is checked by an MPI_QUERY_THREAD call which returns the current level of thread support

Comment: this is another example of an obstacle when the user has a different platform from the platform used by the developer

15

With the current version of MPI library to be able to use some parts of the implemented UM7.8 functionality Joerg suggested to overwrite the UM7.8 setting of MPI_THREAD_SINGLE with MPI_THREAD_FUNNELED (The task can be multi-threaded, but only the main thread will make MPI calls. All MPI calls are funneled to the main thread. )

Results (in sec) using 20x24 decomposition with Lustre file system striping (4Mb, 8 ways), buffer_size= 6000

488 cores (FLUME_IOS_NPROC=8),

8 MPI processes per node

560 cores (FLUME_IOS_NPROC=10), 7 MPI processes per node

431; 436 390; 395

425; 429 404; 413

423; 428 406; 414



16

ConclusionsUsage of MPI_THREAD_FUNNELED does not provide a visible

performance improvement in comparison with the results achieved with the usage of a single thread only

In a case of not overcomitting the nodes on which multithreading is not used gives slightly better performance results which are similar to the results obtained with the usage of a single thread only

The number of wasted cores in a second configuration when only 7 MPI processes are used can be reduced to 0 with the usage of the functionality provided by Joerg’s mprun.py script using its explicit form which will take 10-15 lines of text for a single run command



17

Next steps for future workNext steps for future work

Merge Joerg’s byte swapping procedure from UM7.5 into UM7.8 (Joerg agreed to do this task)

Addressing by Solar Help a request on a thread multiple version of the MPI library could be provided to our site to be able to use the asynchronous functionality with UM7.8

Validation of the numerical results produced with UM7.8

By providing just presented information to Paul Selwood ask himWhat are the IOS main parameter settings used at UKMO site

with UM7.8?

What kind of performance improvement is produced in comparison with UM7.5 for the I/O part?

What kind of parameter settings can be recommended for our case?

Documents

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Optimisation progress for UM7.8 Ilia Bermous