Hybrid programming with MPI and OpenMP On the wayto …(same problems with flat MPI or OpenMP codes) • Data race, deadlock, race condition or wrong data sharing attribute, all the

1

www.idris.fr

Institut du Développement et des Ressources en Informatique Scientifique

Hybrid programming with MPI and OpenMPOn the way to exascale…

Institut du Développement et des Ressources en Informatique Scientifique

www.idris.fr

P.-Fr. Lavallée – WSTOOLS 2012 – October 1st

2

Trends of hardware evolution


• Main problematic : how to deal with power consumption ?

• Simplification and multiplication of the number of cores

− Many-cores (like Intel Xeon Phi, IBM BG/Q)

− Accelerators (like NVIDIA Tesla or AMD Fusion)

− ARM based microprocessors (see http://www.montblanc-project.eu/ for complementary information)

• Common characteristics that impact users and applications

− Huge number of threads of executions, remember that exascale = 1 billion threads of execution !

− Intensive use of SMT or Hyper-Threading to get good performance (at least 2 to 4 threads per core !)

− Vectorization (SIMD) is required to use the hardware efficiently, compiler tries to do its best, but not enough (yet)…

− Memory per execution thread shrinks…

3

Introduction to hybrid MPI+OpenMP parallelization


For homogeneous architectures without accelerators, two well recognized and mature standards to parallelize applications :

• OpenMP : For shared memory architectures

− Directive based API supporting C/C++ and Fortran, to create threads (via parallel region), to choose the data sharing attribute of variables (PRIVATE or SHARED), to share work among the threads (DO, SECTION and TASK) and to synchronize threads (BARRIER, ATOMIC, CRITICAL, FLUSH)

− Latest official OpenMP specifications : version 3.1 (july 2011)

− Waiting for version 4.0 (Error Model, NUMA Support, Accelerators and Tasking Extensions)

• MPI : For all kinds of architecture

− Message passing library supporting C/C++ and Fortran to manage one-sided, point to point or collective communications between processes, to define topologies and derived data types, to deal with parallel IO and to synchronize processes

− Latest official MPI specifications : MPI 3.0 Released September 21, 2012.

4



• Majority of codes parallelized with MPI or OpenMP

• Nevertheless, with some applications, this approach begins to show some limitations on the latest generation of massively parallel architectures for various reasons like :

− Granularity of code (it’s decreasing)

− Memory consumption of your application (doesn’t fit anymore what is available…)

− Algorithm and hardware limitations (visible only beyond a certain threshold…)

− Huge load imbalance (very hard to deal with)

− Overheads (increase with the number of cores)

− All this leads to disappointing performance and very limited scalability !!!

• Solving all these issues is far from being simple…

5



• The main problem is simple : too many MPI processes to manage, with too little work to execute…

• How to reduce the number of MPI processes ?

− Replace MPI processes with OpenMP threads !

− That’s what is called hybrid programing with MPI and OpenMP

− OpenMP can be replaced by any threading library

• Take the best of both approaches :

− MPI to exchange data between nodes

− OpenMP to benefit from the shared memory inside a node

• Mixing MPI and OpenMP with a two levels parallelization seems natural :

− Fits perfectly the hardware characteristics of various machines (either fat or thin nodes…)

− Has a lot of advantages but also some drawbacks, be careful…

6



7

Thread Support in MPI


For a multithreaded MPI application, replace

MPI_INIT(…)

with

MPI_INIT_THREAD(REQUIRED, PROVIDED, IERROR)

• MPI_THREAD_SINGLE : only one thread per MPI process, OpenMP cannot be used

• MPI_THREAD_FUNNELED : multiple threads per MPI process, but only the main thread can make MPI calls. MPI calls outside OpenMP parallel region or by the main thread (the one which made the MPI_INIT_THREAD call)

• MPI_THREAD_SERIALIZED : all threads can make MPI calls, but only one at a time. In an OpenMP parallel region, MPI calls have to be made in critical sections

• MPI_THREAD_MULTIPLE : completely multithreaded without restrictions (except for MPI collective calls using the same communicator)

8



Drawbacks of hybrid MPI+OpenMP approach :

• Complexity of the application (especially with MPI_THREAD_MULTIPLE) and high level of expertise required for developers

• Performance improvements are not guaranteed, having good MPI and OpenMP performances and efficiency is mandatory (Amdahl’s law applies to both approaches)…

• Memory affinity, mapping, binding, etc… has to be carefully managed (same problems with flat MPI or OpenMP codes)

• Data race, deadlock, race condition or wrong data sharing attribute, all the pitfalls of MPI and OpenMP combined and lead to very complex debugging

• No real mature and robust tools to debug hybrid MPI+OpenMP applications at scale (or even on small number of cores…)

So, is there still an interest to hybrid MPI+OpenMP parallelization ?

Yes of course. Fortunately there are even greater advantages…

9

Memory saving


Memory per thread of execution is scarce and hybrid approach optimizes its usage. But why memory saving ?

• The hybrid programming allows optimizing the code to the target architecture. The latter is generally composed of shared-memory nodes (SMP) linked by an interconnect network. The interest of the shared memory inside a node is that it is not necessary to duplicate data in order to exchange them. Every thread can access (read /write) SHARED data.

• The ghost or halo cells, introduced to simplify the programming of MPI codes using domain decomposition, are no longer required within the SMP node. Only the ghost cells associated with the inter-node communications are mandatory. This savings is far from being negligible. It depends heavily on the order of the method, on the domain type (2D or 3D), on the domain decomposition (on one or multiple dimensions) and on the number of cores of the SMP node.

• The footprint memory of system buffers associated with MPI is not negligible and increases with the number of processes. For example, for an Infiniband network with 65000 MPI processes, the footprint memory of system buffers reaches 300MB per process, almost 20TB in total!

10

Memory saving


OK for the theory, but in real life, do we observe any gain in memory consumption of applications ?

11

Memory saving


Code Pure MPI version Hybrid version Memory

savingMPI processes Mem./node MPI x threads Mem./node

CPMD 1152 2400 48 x 24 500 4.8

BQCD 3072 3500 128 x 24 1500 2.3

SP-MZ 4608 2800 192 x 24 1200 2.3

IRS 2592 2600 108 x 24 900 2.9

Jacobi 2304 3850 96 x 24 2100 1.8

• Source: << Mixed Mode Programming on HECToR >>, AnastasiosStathopoulos, August 22, 2010, MSc in High Performance Computing, EPCC

• Target machine: HECToR CRAY XT6

• Results (the memory per node is expressed in MiB)

12

Memory saving


• Source : << Performance evaluations of gyrokinetic Eulerian code GT5D on massively parallel multi-core platforms >>, Yasuhiro Idomura and Sébastien Jolliet, SC11

• Executions on 4096 cores on :

− Fujitsu BX900 with Nehalem-EP processors at 2.93 GHz (8 cores and 24 GiB per node)

− Fujitsu FX1 with SPARC64 VII processors at 2.5 GHz (4 cores and 32 GiB per node)

• All sizes given in TiB

System Pure MPI 4 threads/process 8 threads/process

Total (code+sys) Total (code+sys) Gain Total (code+sys) Gain

BX900 5.4 (3.4+2.0) 2.69 (2.25+0.44) 2.0 - -

FX1 5.4 (3.4+2.0) 2.83 (2.39+0.44) 1.9 2.32 (2.16+0.16) 2.3

13

Conclusions on memory saving


• Too often, this aspect is forgotten when talking about hybrid programming.

• However, the potential gains are very significant and could be exploited to increase the size of the problems to be simulated!

• The gap, in term of memory usage, between the MPI and hybrid approaches will continue to grow rapidly for the next generation of machines :

− Increase in the total number of cores

− Rapid increase in the number of cores within a SMP node

− General use of Hyper-Threading or SMT (the possibility to run simultaneously multiple threads on one core),

− General use of high-order numerical methods (nearly free computational cost thanks to hardware accelerators)

• This will make the transition to hybrid programming almost mandatory...

14

Exceeded algorithmic limitations


• Some applications are sometimes limited in term of scalability by a physical parameter (dimension in one direction for example).

• In the NAS Parallel Benchmark, the problem size define the notion of zone. The maximum number of MPI process cannot exceed the number of zones (limited to 1024 for class D and 2048 for class E problem size)

• The hybrid version of the code is still limited in term of MPI process, but each MPI process can manage multiple OpenMP threads… The total number of threads of execution is the number of MPI process times the number of OpenMP thread per MPI process.

• On BG/P, you can gain up to a factor of 4 and a factor of 16 on a BG/Q with an excellent scalability !

15

Exceeded algorithmic limitations


16

Performance and scalability


Many factors will contribute to increase performance and scalability of applications using an hybrid MPI+OpenMP parallelization :

• Better MPI granularity : hybridation allows to use the same number of execution cores, but with a reduce number of MPI processes. Each MPI process has much more work to manage, improving by the way the granularity of the application…

• Better load balancing : for a pure MPI application, a dynamic load balancing is very complex to implement and is time consuming (requiring heavy use of message passing). For an hybrid application, inside the MPI process, you can easily manage a dynamic load balancing (with the schedule DYNAMIC or GUIDED for parallel loops, or directly by hand using the shared memory). Load balancing is a critical factor for massive parallelism, impacting the scalability of the code.

• Optimization of communications : the reduction of the number of MPI processes minimizes the number of communications and increases the size of messages. Hence, the impact on latency is reduced and the throughput of communications is improved (even more important for applications using collective communications heavily)

17



• Improvement of the convergence of certain iterative algorithms : if your iterative algorithm uses information relative to the local domain associated with MPI process, then reducing the number of MPI process will result in having bigger local domains, with much more information. Hence the rate of convergence of your iterative algorithm will improve leading to better time to solution…

• Optimization of I/O : the reduction of the number of MPI processes leads to less simultaneous disk access and increases the size of records. As a consequence, meta-data servers are less loaded and the size of record is much more adapted to the disk system

• Approach that fits perfectly new architectures (many-cores, …) : with hybrid parallelization, you can naturally create and manage lots of threads, which can be used to overload cores (SMT or Hyper-Threading) and efficiently use the hardware

18



• The potential gains in term of performance are even more important as the number of execution cores is large

• If the hybrid parallelization is well done, the limit of scalability of the hybrid version of the code compared to the flat MPI version can be improved by a factor up to the number of cores of the SMP node !

• Let’s have a look to a real life application named HYDRO

19

Application HYDRO


• HYDRO is a 2D Computational Fluid Dynamics code (~1500 lines of Fortran90), that solves Euler’s equations with a Finite Volume Method using Godunov’s scheme and a Riemann solver at each interface on a regular mesh.

• Selected as the PRACE application benchmark for assessment of WP9 prototypes

• Thanks to many contributors, various versions of HYDRO have been developed :

− Sequential versions : F90, C99

− Accelerated versions : HMPP, Cuda, OpenCL

− Parallel versions : OpenMP (fine and coarse grain), MPI, hybrid MPI+OpenMP

− Others versions : Cilk cache oblivious version, X10, …

20

HYDRO results


• Characteristic of the hybrid version of HYDRO :

− MPI_THREAD_FUNNELED level of thread support (MPI calls done inside the parallel region but only by the master thread)

− MPI parallelization relies on a 2D domain decomposition, with MPI derived datatype and synchronous communications with neighborhoods

− OpenMP parallelization relies on another 2D domain decomposition (coarse grain approach), with fine synchronization among threads managed by FLUSH directive, to cope with dependencies

• We will compare the pure MPI version and the hybrid MPI+OpenMP version of HYDRO

• All timings are in second (s.) and correspond to the elapsed time of the full application

21

HYDRO results


• Goal : Is hybrid approach interesting on a moderate number of execution cores ?

• Target architecture : 2 IBM SP6 nodes (64 cores)

• The total number of threads of execution is fixed to 64. The number of OpenMP threads per MPI process varies from 1 (pure MPI version) to 32.

MPI x OpenMP

per node

Time in s. on 64

execution cores

32 x 1 66.8

16 x 2 60.3

8 x 4 56.4

4 x 8 58.2

2 x 16 58.7

1 x 32 63.1

22

HYDRO results


• Goal : determine if hybrid approach is more scalable than pure MPI ?

• Target architecture : IBM BG/P (10 racks)

• Strong scaling on high number of execution cores (from 4096 to 40960 cores)

• All timings are in second (s.)

Pure MPI Hybrid with 4

threads per MPI prc

4096 cores 61.7 62.4

8192 cores 35.4 31.0

16384 cores 31.0 16.3

32768 cores 80.7 12.0

40960 cores 136.7 12.6

23

HYDRO results


• Scalability limit of the pure MPI version : 8192

• Scalability limit of the hybrid version : optimal 16384, sub optimal 32768

• On 32768 cores, the hybrid version is more than 6 times faster than the pure MPI version…

• The best hybrid version (on 32768 cores) is 2.6 time faster than the best pure MPI version (on 8192 cores)

24

Conclusions


• No need to hybrid parallelization if you don’t face any problem of scalability and/or memory consumption with your MPI application…

• A sustainable approach, based on recognized, mature and widely available standards (MPI and OpenMP); it is a long-term investment.

• The advantages of the hybrid approach compared to the pure MPI approach are many:

− Significant memory saving

− Gains in performance (on a fixed number of execution cores), through a better adaptation of the code to the target architecture

− Gains in terms of scalability, allowing pushing the limit of a code’s scalability of an equal factor to the number of cores of the shared-memory node

• These different gains are proportional to the number of cores of the shared-memory node, a number that will increase significantly in the short term (general use of multi/many-core processors)

• A durable solution that allows an efficient usage of the next massively parallel architectures (multi-peta, exascale, ...) but still has to evolve to take accelerators into account (OpenCL, OpenACC, OpenMP 4.0, …)

Documents

Hybrid programming with MPI and OpenMP On the wayto …(same problems with flat MPI or OpenMP codes) • Data race, deadlock, race condition or wrong data sharing attribute, all the