Upload
hope-reyes
View
213
Download
0
Embed Size (px)
Citation preview
The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology
Optimisation progress for UM7.8
Ilia Bermous7 April 2011
Thank you to Joerg Henrichs, Martin Dix and Mike Naughton for some help and advices during the work
2
Description of global forecast test jobDescription of global forecast test job
Global model with N320L70 resolution
Based on Fabrizio’s xazje job (which came from forecast step of Chris’s APS1 ACCESS-G development suite)
24 hour integration with ~30GB output
~3100-3250 GCR iterations per run (3162 with 7.5 and 3223 with 7.8) for 120 time steps.
Timing results are given in terms of Elapsed CPU Time and Elapsed Wallclock Time from internal UM model timers as reported in UM job output for each run.
All runs used Mike’s version of UM run script; this script is very simple and flexible for these kinds of tasks.
3
UM7.5 best performance results on Solar UM7.5 best performance results on Solar
Used software Intel11.0.083 compiler OpenMPI mpi/sun-8.2 library
UM model environment settings and source change Joerg’s byte swapping procedure Q_POS_METHOD=5 for the improved QPOS algorithms Lustre file system striping
Best elapsed times (in sec) with decomposition of 20x24 => 480 cores (i.e. under 500 cores) Full I/O: (475; 497) (484; 513) (492; 519)
NO I/O: (330; 335) (306; 311) (310; 315)
4
Major UM7.8 developments for performance improvement
Major UM7.8 developments for performance improvement
Asynchronous parallel I/O (requires OpenMP) This new feature is activated at both build and run stage Only works with OpenMP – requires UMUI “Use OpenMP” option in the “User
Information and Submit Method => Job submission method” panel is selected
PMSL revised algorithm (Jacobi algorithm) A revised algorithm based on a Jacobi solver is introduced and the number
if iterations increased. Even with more iterations this new method is cheaper and scales at higher node counts.
Optimisation for FILL_EXTERNAL_HALOS resulted in ~5% reduction in fill_external_halos routine runtime cost
Improved QPOS algorithms All new versions are significantly quicker at scale (~10%) but require
scientific validation as the science is altered somewhat. The "level" method has been validated in PS25 and is now being used in the global model at the Met Office.
5
Summary of attempts made for UM7.8 Summary of attempts made for UM7.8
OpenMP usage with Intel compiler and 1 thread on Solar
OpenMP usage with Intel compiler and 2 threads on Solar
OpenMP usage with SunStudio compiler and 2 threads on Solar
OpenMP usage with Intel compiler and 2 threads on NCI system
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar
6
OpenMP usage with Intel compiler and 1 thread on Solar
OpenMP usage with Intel compiler and 1 thread on Solar
Problems resolved and reportedFound a number of cases in the sources for inconsistent usage
of allocate/deallocate statements and IF block logic => reported to the UM developers
Significant impact on the performance if TMPDIR is used and modified in the UM scripts => reported to the developers
A couple of missing environment variables such as OMP_NUM_THREADS and OMP_STACKSIZE should be set by the UMUI scripts if multithreading is used
A run time crash problem with the usage of Intel11.0.083 compiler is resolved by using Intel11.1.073, the most latest available compiler on our site
7
OpenMP usage with Intel compiler and 1 thread on Solar (cont #2)
OpenMP usage with Intel compiler and 1 thread on Solar (cont #2)
Performance results (20x24x1, Lustre striping, FLUME_IOS_NPROC=8)
UM7.8 UM7.5 (without multithreading)
Full I/O No I/O Full I/O No I/O
370; 389 221; 225 484; 512 330; 335
362; 389 205; 210 475; 497 306; 311
389; 416 209; 212 492; 519 310; 315
228; 235 277; 285
Conclusions:
1. The full I/O case with UM7.8 runs over 20% faster than with UM7.5
2. UM7.8 without usage of I/O runs ~1.5 times faster than with UM7.5
3. There is no visible performance improvement in the I/O part: 179sec vs 186sec
4. In red are results from runs to compare performance section by section
8
OpenMP usage with Intel compiler and 1 thread on Solar (cont #3)
OpenMP usage with Intel compiler and 1 thread on Solar (cont #3)
Performance comparison between top 6 sections for UM7.8 and UM7.5 without usage of I/O
UM7.8 UM7.5
PE_Helmholtz 68.15 PE_Helmholtz 70.76
SL_Full_wind 31.35 ATM_STEP 52.30
ATM_STEP 26.46 SL_Full_wind 31.19
SL_Thermo 21.52 SL_Thermo 27.20
READDUMP 13.53 READDUMP 12.67
Atmos_Physics2 9.55 NI_filter_Ctl 15.63
Conclusions:
1. Comparing the top sections the major performance improvements are coming from
ATM_STEP (25.74sec), NI_filter_Ctl (9.59sec) and SL_Thermo (5.68sec) which gives in total of 41.01sec
9
UM7.8 performance comparison: full I/O, 20x24 decomposition and Lustre striping UM7.8 performance comparison: full I/O, 20x24 decomposition and Lustre striping
EXE1 – Intel11.1.073, -openmp, 1 thread, FLUME_IOS_NPROC=8, buffer_size=6000 (results from the previous slide)
EXE2 – UMUI standard building procedure using Intel11.0.083 with the “safe” level of optimisation (UMUI build job xbauk)
EXE3 – Intel11.1.073, bld.cfg is based on Imtiaz’s version without
“-WB -warn all -warn nointerfaces -align all” (due to Intel compiler problems)
Conclusion:
1. Usage of OpenMP with a single thread and parallel I/O with FLUME_IOS_NPROC=8 does NOT provide any performance advantage
EXE1 EXE2 EXE3
370; 389 381; 411 370; 396
362; 389 362; 395 369; 397
389; 416 366; 392 382; 406
10
OpenMP usage with Intel compiler and 2 threads on Solar
OpenMP usage with Intel compiler and 2 threads on Solar
The same slow performance issue found and investigated in detail for UM7.5 and reported in August 2010 still exists and the issue has not been addressed by Intel at all Monitoring execution of the models sometimes the job starts to
run fast for the first 10-15 steps then it slows down significantly, sometimes this may happen from the start of a run
Elapsed times for a 14x18 decomposition and 2 threads per MPI process with 504 cores and without I/O are (4915sec; 4921sec)in comparison with (360sec; 366sec) using 14x18x1 without I/O
Conclusions: this long standing problem must be addressed by Intel
At the moment it is not the most critical issue in getting asynchronous parallel I/O functionality with UM7.8
11
OpenMP usage with SunStudio compiler and 2 threads on Solar
OpenMP usage with SunStudio compiler and 2 threads on Solar
Problems resolved and reported Usage of POINTER INTENT attributes which is not supported by Fortran
standard => used a work around, reported problem to the UM development team
Multithreading performance results UM performance using 2 threads per MPI process is better than without
OpenMP, but scaling is very poor (Lustre striping was not used):14x18x2threads + FLUME_IOS_NPROC=4 => 508 cores => 753sec14x18x1thread + FLUME_IOS_NPROC=0 => 252 cores => 794secNote:
- Date command output was used to calculate elapsed times- Several runs using different run configurations such as 20x24x1 and 16x32x1
had crash problems, the nature of these problems have not been investigated
- Usage of different optimisation options such as-O3, -O5, –xtarget=native, –xarch=native, –dalign, -g does not make any visible impact on the performance results
12
OpenMP usage with Intel compiler and 2 threads on NCI system
OpenMP usage with Intel compiler and 2 threads on NCI system
The same slow performance issue as on Solar does exist on NCI system using Intel11.1.073 compiler and openmpi1.4.3 library:
(3565sec; 3703sec) Monitoring execution of the model: the job started to run slow from
the first time step
Usage of the latest Intel12.0.084 compiler Compilation crashes for a file, a work around to use “-O0” instead of
“-O2” recommended by Martin fixes the problem
Execution with 14x18 decomposition using a single thread crashes, this run time problem has not been investigated
13
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar
Due to a slow performance issue with multithreading for the computational part, the main idea in this approach is to compile all UM7.8 sources excluding the “io_services” library without
usage of “-openmp” compilation option
to compile the “io_services” library with multithreading using the “-openmp” compilation option
UM7.8 major terms in relation to asynchronous parallel I/O: FLUME_IOS_NPROC – number of MPI tasks allocated to act as IO servers
IOS_Spacing – the gap between IO servers in MPI_COMM_WORLD (for optimal performance a node has no more than one IO server)
buffer_size – amount of data (MB) that each IO server can have outstanding
IOS_use_async_stash – use asynchronous communications to accelerate diagnostics output
IOS_use_async_dump – asynchronous DUMP output not currently available
14
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #2)
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #2)
Found problems/issues Using 14x18x2 configuration with FLUME_IOS_NPROC=8 and spacing of
8 (8 nodes are overcommitted) a run time error problem was produced
forrtl: severe (40): recursive I/O operation, unit 6, file unknown
work around: several write statements to produce similar diagnostic output have been commented out (as per Joerg’s message, Peter Kerney has reported this problem to Intel Support)
The main asynchronous parallel IO functionality due to the latest model development is available only if an MPI library allows that multiple threads can call MPI with no restrictions (MPI_THREAD_MULTIPLE), unfortunately a single threaded support (MPI_THREAD_SINGLE) is provided by our MPI library (OpenMPI), this is checked by an MPI_QUERY_THREAD call which returns the current level of thread support
Comment: this is another example of an obstacle when the user has a different platform from the platform used by the developer
15
With the current version of MPI library to be able to use some parts of the implemented UM7.8 functionality Joerg suggested to overwrite the UM7.8 setting of MPI_THREAD_SINGLE with MPI_THREAD_FUNNELED (The task can be multi-threaded, but only the main thread will make MPI calls. All MPI calls are funneled to the main thread. )
Results (in sec) using 20x24 decomposition with Lustre file system striping (4Mb, 8 ways), buffer_size= 6000
488 cores (FLUME_IOS_NPROC=8),
8 MPI processes per node
560 cores (FLUME_IOS_NPROC=10), 7 MPI processes per node
431; 436 390; 395
425; 429 404; 413
423; 428 406; 414
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #3)
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #3)
16
ConclusionsUsage of MPI_THREAD_FUNNELED does not provide a visible
performance improvement in comparison with the results achieved with the usage of a single thread only
In a case of not overcomitting the nodes on which multithreading is not used gives slightly better performance results which are similar to the results obtained with the usage of a single thread only
The number of wasted cores in a second configuration when only 7 MPI processes are used can be reduced to 0 with the usage of the functionality provided by Joerg’s mprun.py script using its explicit form which will take 10-15 lines of text for a single run command
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #4)
OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #4)
17
Next steps for future workNext steps for future work
Merge Joerg’s byte swapping procedure from UM7.5 into UM7.8 (Joerg agreed to do this task)
Addressing by Solar Help a request on a thread multiple version of the MPI library could be provided to our site to be able to use the asynchronous functionality with UM7.8
Validation of the numerical results produced with UM7.8
By providing just presented information to Paul Selwood ask himWhat are the IOS main parameter settings used at UKMO site
with UM7.8?
What kind of performance improvement is produced in comparison with UM7.5 for the I/O part?
What kind of parameter settings can be recommended for our case?