PERFORMANCE ANALYSIS FRAMEWORK FOR MPI APPLICATION …hpcochep/Publication/INGRID'10/IGRID_2010.pdf · PERFORMANCE ANALYSIS FRAMEWORK FOR MPI APPLICATION SUPPORT ON THE REMOTE INSTRUMENTATION

PERFORMANCE ANALYSIS FRAMEWORK FOR MPI APPLICATION SUPPORT ON THE REMOTE INSTRUMENTATION GRID

A. Cheptsov, B. Koller1

1 High-Performance Computing Center (HLRS), University of Stuttgart, Germany

Abstract For the recent years the Grid has become the most progressive IT trend that has enabled the high-performance computing for a number of scientific communities. The large-scale infrastructures (such as Distributed European Infrastructure for Supercomputing Applications set up in the frame of the DEISA or Remote Instrumentation Infrastructure deployed within the DORII EU projects) enabled the grid technology on practice for many application areas of e-Science and have served a testbed for performing challenging experiments, often involving the results acquired from complex technical and laboratory equipment. However, as the grid technology has matured, the attention is largely shifted towards optimization of Grid resource utilization by the applications. The performance analysis module set up within the DORII project offers scientific applications an advanced tool set for the optimization of performance characteristics on the Grid. The performance analysis tools adapted and techniques elaborated for MPI applications in DORII are presented in this paper and might be of big interest for the optimization of a wide variety of the parallel scientific applications.

Keywords: performance analysis, communication pattern, MPI

1. MOTIVATION AND INTRODUCTION

The widespread use of Internet and Web technology has resulted in leading-edge innovations allowing researches to exploit advanced computational technology in a wide range of application areas of modern science and technology. Grid technology is a fundamental aspect of e-Science that has enabled many scientific societies to get access to high-performance e-Infrastructures in which virtual communities share, federate and exploit the collective power of scientific facilities.

Computation and data grids are highly beneficial for getting the high application scalability characteristics as they offer virtually unlimited computation resources and storage facilities. Moreover, the e-Infrastructures enable the Grid for a great variety of new scientific domains which pose new challenging requirements to the e-Infrastructure. For example, the e-Infrastructure which is set up by the DORII (Deployment of the Remote Instrumentation Infrastructure) project [1] offers a promising way how the modern Grid technology can enhance the level of scientific applications usability, allowing them to have shared access to unique and/or distributed scientific facilities (including data, instruments, computing and communications), regardless of their type and geographical location. The DORII infrastructure consolidates ~2200 CPU cores for the computation and offers a total of 147 TB for storing the data. However, as our experience of porting the majority of scientific applications to the grid e-Infrastructures [2] has revealed, the application performance suffers heavily on a single node due to poor performance characteristics of a standard Grid resource – generic cluster of workstations as compared with the dedicated high-performance computers. This especially concerns the node interconnect and file I/O system characteristics, which can be crucial for the performance of many applications.

MPI is a wide-spread standard for the implementation of parallel applications. MPI applications constitute an important part of the pilot applications which are ported to the e-Infrastructure in the frame of DORII, too. Realization of the message-passing mechanism allows the application to share the computation work over the nodes of the parallel computing system. Efficiency of this sharing is straightforward for gaining high application performance on a single node as well as scalability when running on many nodes.

In order to make running the MPI applications on the Grid even more efficient, special performance improvement techniques can be applied for the MPI applications. These techniques rely mainly on in-depth analysis of the implemented communication patterns. The most efficient way of performance characteristics analysis is application’s source code

instrumentation. With regard to the MPI applications, instrumentation means mainly collecting the time stamps of the main communication events occurred in the application during run time. There are many tools facilitating the application instrumentation as well as GUIs intended for analysis of the application run profile. However, support of those tools on the currently available infrastructures is very poor. On the other hand, there is no clear methodic of different tools’ consolidated use for performance analysis of an application. In consequence, performance analysis of the parallel MPI applications is quite awkward for the developers. In the most cases, the complexity of the performance analysis techniques often prevents application providers from in-depth analysis and performance optimization of their applications.

In order to enable whole ponential of the several multipurpose performance analysis tools for the grid application providers, a performance analysis module has been set up within the middleware architecture of the DORII project. In this paper, we introduce this module and its main components. We begin with description of a practical use case coming from the DORII project, which is OPATM-BFM application. We describe then how to use the tools of the performance analysis module for collecting the communication profile in terms of the test application. Finally, we generalize experience obtained for the test application and present some general performance improvement proposals which might also be of big interest for other parallel applications implemented by means of MPI.

2. A USE CASE

OPATM-BFM is a physical-biogeochemical simulation model [3] developed at Istituto Nazionale di Oceanografia e di Geofisica Sperimentale (OGS) and applied for short-term forecasts of key biogeochemical variables (among others, chlorophyll, salinity) for a wide range of coastal areas, in particular for the Mediterranean Sea, and currently explored within the DORII project.

The model solves the transport-reaction equations (1): (1)

where v is the current velocity, wi is the sinking velocity, kh and kz the

eddy diffusivity constants and Rbio is the biogeochemical reactor that depends, in general, on the other concentrations and on temperature T, short-wave radiation I and other physical variables.

The complexity of OPATM-BFM consists in the great number of prognostic variables to be integrated, dimension of analysed ecosystems and

)IT,,cc,(cR+z

ck

z+ck

z

cw=cv+

t

cNibio

izihh

iii

i ......1

∂∂

∂∂∇+

∂∂∇⋅

∂∂

steps of the numerical solution according to the forecasted period. In this context, OPATM-BFM poses several challenging scenarios for the efficient usage of modern HPC systems for the application.

OPATM-BFM is parallelized by means of MPI, based on domain decomposition over longitudinal elements. MPI implementation enables OPATM-BFM to utilize massive-parallel computing resources. The number of domains corresponds to the number of computing nodes the application is running on. Consistency of the computation by the parallelization and domain decomposition is ensured by the inter-domain communication pattern implemented for exchange of the data cells residing on the domain’s bounds.

Obtaining maximal performance (in terms of the execution time) and scalability (in terms of speed-up due to running on the increasing number of computing nodes) is mandatory for OPATM-BFM’s practical usability for the tasks of the real complexity (in particular long-term simulation). Moreover, the application poses great challenge for different HPC architectures with regard to both optimal utilization of resources for performing the identified complex tasks of environmental simulation and development of algorithms enabling such an efficient utilization [4].

The application’s standard run consists of three major phases: initialization (where the model is initialized and input files are read), main simulation loop (iterative solver, number of iterations depends on the forecasted period’s length, each step corresponds to a half an hour of the simulated system’s behaviour) and data storage (file storage operations, at the end of simulation or after each 48th step of the numerical solution).

Analysis of the application time characteristics in the identified phases [2] with regard to computation on a single node and communication between the nodes (notably, MPI functions duration) has revealed that scalability on different number of nodes (in the current experiment, execution on 32 and 64 nodes has been analysed) is quite poor for some operations (see Table 1).

However, it is highly important to highlight that time distribution among the phases of execution in the tested use case (2nd column in Table 1) can hugely differ as compared with a real long-term simulation use case (3rd column in Table 1) due to different impact of iteratively repeated operations on the total time (the last row in Table 1). Application performance speed-up when running on bigger number of nodes is also changing according to the use case. Nevertheless, the internal characteristics of the iterative phase are iteration-independent and valid not only for the test one but for the all use cases.

Table 1. Application performance on changing number of computing nodes

32 nodes, with one process per

node, [s]

64 nodes, with one process per

node, [s]

Scalability coefficient,

t64/t32, [s] Phases of execution

#

iterations,

real case

#

iterations,

test case Comput. duration

MPI calls duration

Total time

Comput. duration

MPI calls duration

Total time

Comput MPI calls

Total

1. Initializa-tion, Input 1 1 939 6 945 2238 6 2244 2,4 1 2,4

2. Main simulation loop

816 3 3 5 8 2 5 7 0,7 1 0,9

3. Data storage 17 1 7 204 213 3 170 173 0,4 1,25 0,8

Total time, [s] 949 226 1175 2243 181 2424 2,4 0,8 2

Figure 1. Using performance analysis tools for OPATM-BFM

There are several software tools which facilitate instrumentation of the source code and collection of the details about the occurred events as well as further analysis of those events. Some of the tools (e.g. Valgrind tool suite) are useful for static analysis of the single application’s process on a single computing node, while others aim at parallel applications and focus on analysis of the MPI communication between the nodes of the parallel computer (e.g. Vampir, Paraver). The DORII project’s applications, including presented OPATM-BFM, can greatly benefit from use of both mentioned categories of tools. Whereas profiling a single MPI process will allow the user to identify the source code regions where the most time-consuming communication takes place, the detailed information about the interactions (messages for the MPI applications) between the different processes of a parallel program is provided by the communication analysis tools (Figure 1).

3. PERFORMANCE ANALYSIS TOOLS AND TECHNIQUES

The practical attempts to design a scalable and easy-to-use performance measurement and monitoring software environment for supercomputing applications running on the Remote Instrumentation Infrastructure resulted in a special module provided within the middleware architecture of the DORII project. In the following we give a brief description of the main tools comprised in the module – Valgrind and Vampir/VampirTrace and highlight the main usage scenarios of those for the exemplar use case from previous Section – the OPATM-BFM application.

Valgrind [5] is an instrumentation framework for building dynamic analysis tools for debugging and profiling sequential applications (or single processes of a parallel application), aiming at speeding-up application performance. Valgrind’s distribution is open source and currently includes several production-quality tools, including a memory error and thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler and others.

In order to proceed with the parallel application analysis efficiently, the phases of application execution should firstly be identified. The localization of the most computation- and communication-intensive phases without help of any profiling tools is a non-trivial and quite complicated task that requires deep understanding of the application source code as well as implemented algorithms. However, Valgrind is capable of building a so called application call graph which is sufficient for basic understanding of the dependencies

between the application’s code regions as well as time characteristics of the communication events in those regions. This is done by means of the Callgring tool which is included to the current Valgrind distribution. Moreover, details on the I1, D1 and L2 CPU caches are provided by Callgrind as well. In addition, Valgrind provides a powerful visualizer for the data produced by Callgrind, which is KCachegrind. For example, in Figure 2 a fragment of the OPATM-BFM’s call graph is shown, visualized through KCachegrind, based on which the main phases of application run (Table 1) have been identified.

Analysis of the single process’ run profile is also an excellent starting point for profiling communication between the MPI processes in the parallel application. There are several tools designed for large-scale applications analysis [6]. In the frame of the Remote Instrumentation Infrastructure, we chose the VampirTrace tool [7] because of its tight integration to the Open MPI library, used for the implementation of parallel applications by the DORII consortium. However, the introduced performance analysis module doesn’t eliminate use of other famous tools, including Paraver, Scalasca for postmortem analysis or Periscope which is used for analysis at run-time. VampirTrace is an application tracing package that collects a very fine grained event trace of a sequential or parallel program. The traces can be

visualized with the Vampir or any other tool that reads the Open Trace Format (Figure 3). Another advantage of VampirTrace is that no changes in the application’s code are required for switching on profile collection functionality, only recompilation with corresponding VampirTrace libraries (or using the wrapper provided with Open MPI).

Figure 2. The OPATM-BFM’s call graph’s fragment (visualized with KCachegrind)

Figure 3. Example of the OPATM-BFM’s communication profile visualization with Vampir

To sum up, analysis of the MPI application’s communication patterns as well as time distribution among the main phases of a single process is largely based on the profile collected at application’s run time. The profile is stored in the trace files of the special format and can be visualized by means of the dedicated graphical front-ends after execution is completed. When using performance analysis tools, it is highly important to note that size of the trace files obtained is proportionate to the number of the events occurred. For the large-scale parallel applications, which are often implemented with a significant number of function calls, in particular to the MPI library, the large size of the collected trace files poses a serious drawback for their further exploration with the visualization front-ends (Figure 4).

In such cases, a preliminary analysis step is required which aims at detecting the application execution’s regions with the communications having the maximal impact on execution time. In general, the low-priority communication events should be discarded from profiling that in turn leads to decreasing the size of the trace files with the collected events. This can be done either filtering the events in the defined regions or launching the application for special use cases with a limited number of iterations.

However, this task is not trivial and approaches which allow the user to filter the events in the trace files differ from application to application. Nevertheless, it can greatly be supported by the tools of the first category, in particular based on the call graph analysis. For example, even for a standard (short term) use case, the OPATM-BFM application presented above performs more then 250000 calls to the MPI library. In a consequence, the size of the trace data grew up to several tens of gigabytes. Due to preliminary analysis of the call graph, the trace file size was minimized down to only 200Mb, which allowed us to proceed with post-processing analysis even on the low-performance machine.

Therefore, the combination of those tools presented above on the joint basis, provided by the performance analysis module, offers the user a complete environment for efficient and effective application performance analysis.

Figure 4. Example of a cluttered communication profile

4. SOME OPTIMIZATION RESULTS

There are several techniques that could be applied for MPI applications aiming at their performance improvement. As outcome of exemplarily use of the performance analysis module for OPATM-BFM, the following main improvement proposals have been specified:

• use of MPI-IO operations for parallel access/storage of NetCDF data (currently the application I/O have been ported to the PNetCDF library);

• use of the collective MPI communication for inter-domain communication;

• encapsulation of the data chunks transmitted inside a single MPI communication.

The realization of the optimization proposals for the application

communication and I/O patterns elaborated in [2] for OPATM-BFM allowed us to improve dramatically performance for the test case (3 steps of numerical solution), as shown in Table 2.

Table 2. Comparison of application time characteristics before and after optimization 32 nodes

Total duration, [s] 64 nodes

Total duration, [s] Phases of execution Initial Optimized Initial Optimized

1. Initialization, Input 945 300 2244 360

2. Main simulation loop 8 7 7 6

3. Data storage 213 207,5 173 172

Total 1175 514,5 2424 538 Whereas file I/O operations are dominating for the application execution

for a small number of simulation steps (3 steps for the test case), the overall performance improvement due to optimization of the MPI communication becomes significant only for a long-term simulation (816 steps for the real case). As Table 2 shows, the total amount of realized optimizations allowed us to reduce the duration of the application execution for the real case by up to 5% (from initially measured 309 min down to 293 min). Furthermore, the optimization for the increasing number of nodes from 32 up to 64 grew up to 15% (from 213 min down to only 185 min). The application scalability grew accordingly from 145% up to 158%.

5. CONCLUSIONS

The main aim of this paper was to show how different performance analysis tools and strategies can be consolidated and applied for supercomputing applications. The DORII project enables grid technology for many new applications, which performance properties play an important role in practical usability within the modern grid environments. Performance analysis is therefore straightforward for improved run of the applications, in particular for use in production.

Integration of all the described tools in the common performance analysis module introduced in this paper, allowed us to holistically investigate the performance of one of the most challenging applications deployed on the Remote Instrumentation Infrastructure – OPATM-BFM. The performance improvement techniques applied for the application allowed us to optimize the application characteristics when running on a standard EGEE grid site. The performance analysis module will be used to tune the performance of academic and industrial simulation applications, not limited on the ones coming from the DORII consortium.

REFERENCES

[1] See the web page of the DORII project – http://www.dorii.org

[2] Alexey Cheptsov, Kiril Dichev, Rainer Keller, Paolo Lazzari and Stefano Salon: Porting the OPATM-BFM Applicationto a Grid e-Infrastructure – Optimization of Communication and I/O Patterns, Computational Methods in Science and Technology, 15(1), 9-19 (2009)

[3] A. Crise, P. Lazzari, S. Salon, and A. Teruzzi. MERSEA deliverable D11.2.1.3 - Final report on the BFM OGS-OPA Transport module, 21 pp., 2008.

[4] Alexey Cheptsov: Enabling grid-driven supercomputing for oceanographic applications – theory and deployment of hybrid OpenMP+MPI parallel model for the OPATM-BFM application. Proceeding of HPC-Europa project's Transnational Access Meeting, Montpellier, October 14th – 16th 2009

[5] J. Seward, N. Nethercote, J. Weidendorfer and the Valgrind Development Team. Valgrind 3.3 - Advanced Debugging and Profiling for GNU/Linux applications. http://www.network-theory.co.uk/valgrind/manual/

[6] F. Wolf: Performance Tools for Petascale Sysems. inSiDE, Vol. 7, No. 2, pp. 38-39 (2009).

[7] A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M.S. Müller, and W.E. Nagel. The Vampir Performance Analysis Tool-Set. Tools for High Performance Computing, Springer, 2008, 139-156.

Documents

PERFORMANCE ANALYSIS FRAMEWORK FOR MPI APPLICATION …hpcochep/Publication/INGRID'10/IGRID_2010.pdf · PERFORMANCE ANALYSIS FRAMEWORK FOR MPI APPLICATION SUPPORT ON THE REMOTE INSTRUMENTATION