Optimal utilization of heterogeneous resources for ...korobkin/tmp/SC10/papers/pdfs/...some heterogeneous resources as demonstrated in the first Petaflops system called Roadrunner

Abstract—Biomolecular simulations have traditionally benefited from increases in the processor clock speed and coarse-grain inter-node parallelism on large-scale clusters. With stagnating clock frequencies, the evolutionary path for performance of microprocessors is maintained by virtue of core multiplication. Graphical processing units (GPUs) offer revolutionary performance potential at the cost of increased programming complexity. Furthermore, it has been extremely challenging to effectively utilize heterogeneous resources (host processor and GPU cores) for scientific simulations, as underlying systems, programming models and tools are continually evolving. In this paper, we present a parametric study demonstrating approaches to exploit resources of heterogeneous systems to reduce time-to-solution of a production-level application for biological simulations. By overlapping and pipelining computation and communication, we observe up to 10-fold application acceleration in multi-core and multi-GPU environments illustrating significant performance improvements over code acceleration approaches, where the host-to-accelerator ratio is static, and is constrained by a given algorithmic implementation.

Index Terms— Biomolecular simulations, performance tuning, multi-core, heterogeneous processors, hybrid multi-core

I. INTRODUCTION PPLICATION developers have been adapting to the tradeoffs of higher floating-point (FP) computation

capabilities and increasing parallelism offered by traditional, distributed-memory architectures that have been introduced over the past few decades [10, 27]. Recently however, availability of multi-core processors has introduced an additional level of parallelism and subsequently complex

S. S. Hampton is with the Oak Ridge National Laboratory, Oak Ridge TN

37830. E-mail: [email protected]. S. R. Alam is with the Swiss National Supercomputing Center, Manno,

Switzerland. E-mail: [email protected]. P. S. Crozier is with Sandia National Laboratories, Albuquerque NM

87185. E-mail: [email protected]. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed-Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC-94AL85000.

P. K. Agarwal is with the Oak Ridge National Laboratory, Oak Ridge TN 37830. E-mail: [email protected].

© 2010 IEEE Personal use of this material is permitted. However,

permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC10 November 2010, New Orleans, Louisiana, USA 978-1-4244-7558-2/10/$26.00

memory hierarchies into the design consideration of parallel applications while the clock frequencies remain static or are even declining. Weighing tradeoffs of familiar programming approaches versus prospects of substantial performance gains, application development teams targeting the grand challenge problems can now exploit much higher levels of FP capabilities in the form of heterogeneous processors, such as Graphical Processing Unit (GPU) devices. These devices offer tremendous performance potential versus mainstream, homogeneous multi-core processors at the expense of complex memory and programming models. Specifically the GPU devices targeted for scientific computing (for example, NVIDIA’s Tesla systems) offer an additional number of device cores and higher memory capacities as compared to their visualization counterparts. Compute Unified Device Architecture (CUDA) is the associated programming environ-ment that has enabled the scientific code development community to harness the potential of these devices [2, 5, 28].

An increasing number of applications continue to report success in utilizing GPUs for improving code performance [14, 26, 34]. The speedups reported typically are based on use of single, or perhaps a few GPUs in a workstation environment. Even though GPUs are ubiquitously available and GPUs specifically tailored for scientific applications (with increasing FP capacity, more memory and increased bandwidth) have been around for a few years, the proof-of-concept, production-level, application implementations on the GPU devices on workstation and cluster configurations have not been adopted by the wider end-user community. This can be partially explained by the observation that most of these accelerated implementations impose several restrictions in terms of problem sizes and system configurations. Often the gains from GPUs typically come at the cost of excluding multi-core processor parallelism. Unlike acceleration of standalone kernels where new programming models could be explored extensively, large-scale applications that are being developed and used by a wide community over decades have several data structure and control dependencies that in some cases prohibit, and in most cases limit, introduction of emerging programming concepts and optimization techniques.

The current approach, however, of selective and rather restrictive utilization of resources is expected to pose greater challenges as the number of cores available on the host CPUs continues to grow. Moreover, it is expected that the future high performance computing (HPC) systems are likely to have

Optimal utilization of heterogeneous resources for biomolecular simulations

Scott S. Hampton, Sadaf R. Alam, Paul S. Crozier and Pratul K. Agarwal

A

some heterogeneous resources as demonstrated in the first Petaflops system called Roadrunner [20]. For applications developed for such hybrid systems, the end user will need to make a decision between taking the performance gains from CPU-cores or GPU devices (and other heterogeneous resources). However, our hypothesis is that the benefit of the heterogeneous architectures will come not only from improvement in time-to-solution from a single set of resources but from combination of most (or even all) resources. This requires application engineering and parameter tuning where the end user could be guided by experimental data from a range of multi-core processors, in combination with GPU-devices operating in an HPC environment.

In this paper, we describe a systematic study of biomolecular simulations with GPU-enabled molecular dynamics (MD) software package called LAMMPS in multi-node, multi-core and multi-GPU environments. LAMMPS has a wide user community and it contains routines applicable to performing simulations in biology, chemistry, and materials science [3, 30]. To enable the capability of simulating long time-scales, we accelerate two methods implemented in LAMMPS: (1) the particle-particle-particle-mesh method (PPPM) which is similar to other particle mesh Ewald (PME) methods [17]; and (2) a cut-off based method where only the interactions within a pre-set radius are computed. Hence, the parameters used for the study presented in this paper include:

• algorithm type (PME or cut-off)—determines amount of off-loaded computation

• system size (number of atoms)—defines host, GPU and communication work-load

• number of processor cores per node—host CPUs with shared memory hierarchy

• number of GPU devices per node—connected directly to the host over PCIe connection

• number of nodes—of a GPU-enabled cluster • single versus double-precision on GPU—comparison

of Tesla 10-series and its successor, 20-series codenamed Fermi

Our results demonstrate that a systematic parametric study will allow system architects to design systems in a cost effective manner, and will enable end users to effectively utilize resources of the heterogeneous systems. We assert that our tuning methodology would enable an end user to target a workstation or a large cluster with GPU devices efficiently depending on the size of a given problem and the targeted algorithm. We also present early results from a double-precision FP-enabled GPU device called Fermi [4]. We compare results of Tesla 20-series codenamed Fermi over its predecessor device called Tesla 10-series (including C1060 and S1070) highlighting the effects of significantly improved double-precision FP native support on the time-to-solution as well as on the accuracy of the biomolecular simulations.

The paper layout is as follows: background and motivation along with the related work is outlined in Section II; implementation methodology is detailed in Section III; results

and discussion are provided in Section IV and conclusions and future directions are in Section V.

II. BACKGROUND AND MOTIVATION Atomistic biomolecular MD simulations are

computationally intensive as each step involves billions of arithmetic operations that model the interactions between thousands of particles that additionally must be repeated millions or more times. For a typical biomolecular system, it is possible to generate MD trajectories containing several nanoseconds per day using a workstation and upwards of 10-40 nanoseconds per day with access to large supercomputers. However, the desired time scales for biological activity of interest are typically in the range of microseconds to milliseconds, which are several orders of magnitude away from what can be simulated in reasonable or realistic wall clock times. Therefore, the need for a revolutionary approach to performance optimization of MD codes and improvements in the wall clock time to generate the required time scales, that is to say the time-to-solution is critical. Alternate computer hardware technologies have attracted considerable interest for improving the time-to-solution of MD simulations. A number of general purpose and specialized hardware systems are being investigated [9, 11, 16, 18, 21-24, 31-33].

GPUs are being actively pursued as a solution since they provide a tremendous increase in the peak computing power available. Originally developed for pixel manipulation, the GPU capabilities that allow scientific application acceleration include a large number of floating-point operations and memory bandwidth capabilities as compared to a traditional microprocessor. There are multiple parallel execution units, called thread processors (also known as GPU-cores), which provide the large arithmetic capability. As compared to a few floating-point execution units on the CPU, the GPU device has hundreds of such execution units, with a hierarchy of device memory associated with them. The memory bandwidth for GPU-cores is also significantly higher, hence the data residing in the GPU memory is accessible to the GPU-cores at a very high bandwidth, which in turn helps in sustaining the computational processing advantage. The large arithmetic capability of GPUs provides significant advantage for scientific applications. At present, there are 240 GPU-cores that are available in the NVIDIA’s Tesla C1060 (the first generation of GPUs specifically designed for HPC applications) with a total theoretical capability of 936 x 109 floating point operations per second. The successor of Tesla 10-series, the Fermi HPC GPU device, attempts to overcome several barriers in order to achieve wide adaptation within the scientific community, namely improved support for double-precision arithmetic, error correcting memories and a cache. We target a prototype version of this device and present details and results in the subsequent sections of this paper.

Although the hardware is affordable, there are several challenges on the programming front in terms of a standard programming model across the range of accelerator devices.

Running applications on these GPUs requires significant changes to the application code structure. In the past, the adoption of GPUs was hindered due to the lack of high-level language (Fortran, C, C++) support for applications including MD. This has recently changed, as NVIDIA now provides a software framework known as CUDA, which gives users a high level programming environment where the applications can be ported and optimized. Another vendor independent framework named OpenCL is available, which is mainly targeted for code development interfaces [6]. The purpose of CUDA is to hide the intricate details of the hardware and to provide a base for writing device-independent software. Several recent research and development efforts target various aspects of programming, performance tuning and load balancing on GPU-based heterogeneous systems but none of these techniques have yet demonstrated an impact on production level applications [1, 7, 12, 13, 15, 25].

Unfortunately for the end user of the biomolecular simulation software, details of the CUDA framework are still fairly complex. Therefore, the benefit of this hardware solution still remains out of reach. The efficient use of the GPU computational resources requires careful offloading of the data to the GPU device, independent execution of the computations on the GPU-cores and collection of the results. Currently, this capability is not automatically available to the end-user at compilation time. It should be mentioned that some programming approaches could hide these explicit data movements; however, these high-level interfaces have yet to demonstrate a level of maturity such that performance is comparable to manual code instrumentation and control [1, 7]. For maximum performance gains, each code must be carefully rewritten to take advantage of GPU capabilities. In a traditional programming environment, a subroutine will typically loop over a range of data and sequentially apply an operation to it. On the GPU, however, a subroutine or kernel must be written as if the code is being executed on many pieces of data simultaneously. This is a very different programming paradigm and requires careful code restructuring for decomposing the problem domain to enable it to run most efficiently on the GPU. Before a kernel can be sent to the GPU for execution, it must first be determined how to separate the work into a large number of computational blocks. These blocks will execute independently of each other and they have no direct means of inter-communication. Furthermore, there is no guarantee on the order in which blocks will be executed. Thus, inter-block dependencies are not possible.

Although our interest is in GPU acceleration, several other solutions have emerged, including the use of specialized hardware [16, 31, 33] and reconfigurable computing devices such as Field-Programmable Gate Arrays (FGPAs) [9, 24]. GPU-enabled software is now fairly common and many studies have been done showing the benefits and potential improvements [11, 18, 21, 22, 29, 32]. However, a common thread throughout these initial studies is the pairing of single cores to a GPU device. In this paper, we set out to investigate

the performance of the entire node when paired with one or more GPUs.

III. METHODOLOGY AND SOFTWARE IMPLEMENTATION For purposes of this study, we chose the MD engine

LAMMPS (Large-scale Atomic Molecular Massively Parallel Simulator), due to its performance characteristics and wide user base. LAMMPS contains routines applicable to performing simulations in biology, chemistry, and materials science. It supports the popular CHARMM and AMBER force fields for biomolecular simulations. The massively parallel scalability of LAMMPS, even up to billion atom systems is well documented [19], but it also does very well on single workstations. LAMMPS is written in C++ and uses Message Passing Interface (MPI) for inter-processor communication. LAMMPS is an open-source code, distributed under the terms of the GNU Public License (GPL) as are the modifications described in this manuscript.

Molecular dynamics is a computational tool for simulating the behavior of a system of particles over time by integration of Newton's equations of motion. Newton's law states that the forces F on an object are proportional to the mass m and acceleration a, F = m a. The acceleration is defined to be the rate of change of the velocities. In classical biological simulations, the forces F are a combination of bonded and non-bonded terms. The work for calculating the bonded terms is proportional to the number of atoms, N. This includes bonds between pairs of atoms, angle bonds between triplets, and dihedral bonds among quadruplets. Calculating the non-bonded terms, electrostatic and Lennard-Jones interactions, is proportional to N2 since these require determining all-to-all interactions. Typically, there are good approximations that can compute the long range forces in O(N log N) time or better with high accuracy. One such method implemented in LAMMPS is called the particle-particle-particle-mesh (PPPM) and is similar to other particle mesh Ewald (PME) methods [17]. One downside to these methods is their dependency on fast Fourier transforms (FFTs). For large systems, the communication requirements of FFTs often limit the performance of the application. This is especially true when implementing these methods on GPUs. Certain applications may also benefit from the much simpler cut-off method. In this algorithm, only interactions within a pre-set radius are computed. Usually, a switching function is applied at the radius boundary to allow for distant forces to be smoothly set to 0. This is especially important for the stability of long simulations.

Several recent efforts have been directed at porting and optimization of MD codes to GPUs [11, 18, 21-23, 32]. It is anticipated that the hardware-software solution that takes popular MD codes and accelerates them transparently for the end user, without making assumptions or placing special conditions, will make the greatest impact. This will motivate acceptance in the wider community where the end users do

not have to worry about the porting details of the code that they routinely use for their biophysical investigations. Placing restrictions on the simulation conditions would also limit widespread use.

Our long term goal it to improve the time-to-solution to enable simulation of longer MD trajectories by overcoming the scaling challenges and efficiently exploiting various hardware technologies [21]. Therefore, our approach consists of off-loading the computations onto the GPU while operating under the parallel environment (MPI); the rationale for this approach is to keep the benefits of the parallelism at the microprocessor level and gain from additional computational resources of the GPUs attached to the microprocessors. This approach allows improving the time-to-solution on single workstations with the multi core/multi-GPU environment as well as on Linux clusters. Moreover, this approach will improve performance on the next generation supercomputers, as they are beginning to provide the additional computing power of GPUs associated with each node. Furthermore, this study provides insight into how GPU resources should be distributed such that maximum performance is achieved. To the best of our knowledge, no one has previously analyzed the benefits and limitations of full CPU node usage with a limited number of GPUs, at least for molecular dynamics simulations.

As we have recently detailed in an earlier publication [21], our current implementation is based on matching the architectural features of the GPU with the simulation requirements. Fig. 1 depicts an overview of our software implementation approach. Although the arithmetic units (cores) on the GPU are very fast and available in large numbers, transferring data between the host CPU and the GPU

memory is costly. From Amdahl's law, it is known that to make the biggest impact on performance, improve an algorithm where the majority of the computation is done. Thus, only the most compute intensive calculations, the non-bonded calculations, typically requiring more than 85% of the

TABLE I BREAKDOWN OF THE CUT-OFF AND PPPM METHOD.

Cut-off based method Pair time* 5272.78 85.60% Bond time 276.11 4.50% Neigh time* 573.3 9.30% Comm. time 6.66 0.10% Output time 0.08 0.00% Other time 33.47 0.50% Total (sec) 6162.4

PPPM method Pair time* 8772.08 81.60% Bond time 263.51 2.50% Neigh time* 832.81 7.70% Comm. time 6.68 0.10% Output time 0.07 0.00% Other time 181.79 1.70% K-space time 689.19 6.40% Total (sec) 10746.1

The time (in seconds) and percentage break-down for the various LAMMPS sub-routines (see Fig. 1 for details) are provided. These tests were performed under serial execution conditions (setup B in Table II) for a 320,000 atom system (rhodoX10, further details are provided in Section IV). See section III A for description of the cut-off and PPPM methods. * = target sub-routines for GPU off-loading.

Fig. 1. Schematic overview of the computational off-loading in GPU-enabled LAMMPS with pipelining. (a) The computationally expensive portions of the code are off-loaded onto the attached GPU device. During each time-step (δt), the CPU version performed all computations on the host CPU-cores, while in the GPU-enabled version the non-bonded calculations including the Lennard-Jones (LJ) and the electrostatic (EEL) interactions are computed on the GPU devices and the results are returned to the host. Optimization of the performance includes overlapping the computations on CPU and GPU. Note that for the PPPM method the real (direct) space is performed on the GPU, while the Fourier (k-space) is performed by the CPU. (b) Pipelining is used to off-load computations on to a single (or multiple) GPU device from multiple cores. (c) In a multi-node GPU cluster environment MPI is used by LAMMPS for off-node communication, while computations are off-loaded onto the GPU devices with the pipelining mechanism.

actual simulation run-time, are off-loaded to the GPU devices (see Table I). The bonded term calculations are relatively inexpensive, requiring less than 5% of the simulation time, and are handled by the CPU.

A. PME (PPPM) and cut-off based method as two alternate programming models The PME method, as mentioned above, is commonly used

in biomolecular simulations while the cut-off based method is not routinely used for production quality simulations. However, we consider these two methods for our investigations as two alternate models for a qualitative study of the CPU/GPU computational balance. This is relevant for the off-node computation/communication ratio and has scalability implications. For example, consider the calculation break-down in Table I, which shows that for this cut-off based simulation that 95% of the computation is spent in calculations (Pair and Neighbor times) that are off-loaded to the GPU. Disregarding communication, the ratio of work from GPU to CPU is nearly 20 in this case. However, for PPPM simulations, just less than 90% of the work is done by the GPU, which results in approximately a 10 to 1 ratio. Note that Table I provides a representative breakdown, for smaller system sizes the Pair and Neigh times typically constitute between 80 and 90%.

B. Computation off-loading in multi-core and multi-GPUs environments In a multi-processor environment, LAMMPS uses a spatial

decomposition method to partition the simulation volume evenly among the processes. Atoms located within a section of volume belong to the process for the duration of the current time-step. Other atoms that are close to the volume boundaries, and thus are likely to interact with the owned atoms, are copied and appear as ghost atoms to the process. That is, they are used to compute effects on the existing atoms, but are otherwise ignored. This type of division is ideal for accelerator-based computation in that it ensures that all information needed for the current time-step is locally available. Using this model, the most expensive calculations are sent to the GPU, and only the final results need to be retrieved. There is no additional communication necessary. Fig. 1(a) demonstrates the execution stream for the force and energy calculations during a typical time-step in LAMMPS and Fig. 1(c) shows the distribution of work among MPI processes. [The results presented in this paper are based on the use of OpenMPI; however, our implementation is compatible with other MPI flavors as well.]

The best performance occurs when there is a one-to-one ratio between compute cores and GPU devices, due to the spatial decomposition technique used for parallelism in LAMMPS and load-balancing across multiple CPU-cores. However, on typical systems there are more cores than there are GPUs. With the current Tesla cards and CUDA v2.3, a single GPU cannot execute more than one kernel simultaneously from independent processes. However, if

given more than one kernel to execute, it will complete them sequentially in a FIFO (first-in first-out) manner. This is schematically represented in Fig. 1(b). Since each CPU is making an asynchronous call and then returning to its own work, there is still room for application speedup despite the GPU being oversubscribed. If the CUDA runtime system receives multiple kernel calls simultaneously, it delays their execution until the current kernel is complete, and then proceeds with the next available one. This is especially true when the CPU has a larger portion of computation or communication such as with FFT based methods. We note that the Fermi cards and CUDA v3.0 will be able to execute up to 4 kernels simultaneously; however, these calls should be invoked by the same instance of CUDA. This places some restrictions on the algorithms for being able to access a GPU device from more than 4 CPU-cores.

C. Concurrent execution on CPU and GPU In Fig. 1 (a), notice that there is a clear distinction between

computing the short-range bonded terms and the long-range non-bonded terms. Since these two can be calculated independently of each other, it is logical to compute them simultaneously. The short-range terms are relatively inexpensive to compute, so they are left for the CPU, while the expensive long-range calculations are sent to the GPU. In the case of the PPPM (PME) method, as described above, the k-space portion of the computation (which utilizes FFTs) is performed by the CPU.

The concurrent execution on CPU and GPU is made possible by a feature of the CUDA programming environment that makes kernel calls asynchronous. The memory reservation and copying of dynamic variables needed for a kernel must happen before the call itself. Once the data is prepared and the call made to the GPU, the CPU is free to return to other tasks. It is then also the responsibility of the CPU to determine if the kernel is complete by querying and, if necessary, waiting until it is. Currently, we do not perform any explicit load balancing, so it may be that the CPU or GPU is idle while waiting on the other. We are investigating methods for improving upon this potential bottleneck.

IV. RESULTS AND ANALYSIS We investigated the impact of a number of parameters on

the application time-to-solution, including: different hardware test configurations (Table II); number of GPUs; number of cores; and number of nodes. For analysis, we monitored the time-to-solution for 6 biological test cases, with fixed number of time-steps (1000) with 2 femto-second (fs) time-step. The selection of time-to-solution as a performance metric is due to the rationale that in biomolecular simulations, push from the community has been for longer time-scale simulations in a fixed wall-clock time. Additionally, modeling of larger systems (with more atoms) has also been desired to enable more accurate models of the complex machines that operate at the molecular level [8].

A. Target Systems and Test Cases A variety of host system configurations were investigated in

this study. As depicted in Table II, the selected hardware and configurations are representative of the variety and combination of hardware systems currently available to the end-user community.

In order to demonstrate a wide range of performance markers for common biomolecular systems, we chose three different simulation sizes (from ~24K to 320K atoms). The JAC (Joint-AMBER-CHARMM benchmark) is a popular MD benchmark used in the community with enzyme dihydrofolate reductase solvated in water (23,558 atoms). The standard LAMMPS rhodo test case consists of the rhodopsin protein in an explicit lipid bilayer and explicit solvent molecules (32,000 atoms). Using the replicate feature of LAMMPS, rhodo was replicated 3 and 10 times to simulate systems with 96,000 (referred as rhodoX3) and 320,000 atoms (referred to as rhodoX10). Each of these three systems was simulated using both the cut-off based method and the particle-particle-particle mesh (PPPM) method for long range electrostatics. Therefore, there are a total of 6 test cases used in this study: JAC/cut-off, JAC/PME, rhodoX3/cut-off, rhodoX3/PME, rhodoX10/cut-off and rhodoX10/PME.

The CHARMM force-field was used in all simulations, with the inner and outer cut-off radii set to 8 and 10 Å respectively. Note that longer cut-offs would have yielded better scaling for the cut-off based methods; however, conservative values were used here. These 6 test cases have been selected to represent the typical workloads of the commonly used biomolecular systems in the end-user community.

B. Pipelining, exploiting multiple GPUs in a multi-core environment Our approach is built on the ability to exploit parallelism

and concurrency at various levels of hardware. The use of pipelining to off-load non-bonded terms allows multiple CPU-

cores (each core is assigned a single MPI task under LAMMPS) to utilize a single or more GPUs. As depicted in Figs. 2 and 3, this allows a significant improvement in performance for both the cut-off and the PME (PPPM) method based simulations. As mentioned above, the motivation for investigating both the cut-off and PME method is to investigate two alternate programming models. The cut-off based method provides a workload that is characteristic of the need of large FP arithmetic with limited information exchange between the concurrent elements (CPU-cores). In this type of workload, off-loading the non-bonded calculations onto the GPUs provides the best utilization of the resources. On the other hand, the case of the PME method requires the use of FFTs, which results in significant communication between the concurrent elements as well as a slightly smaller requirement of the FP arithmetic. Overall, the results show that a single workstation with a total of 8 CPU-cores (2 quad-core processors) and 1, 2 and 4 NVIDIA Tesla C1060 cards can provide about a 5-40 fold speed-up over serial execution. Comparison with other efforts for improving MD codes indicates that our approach can meet and exceed single node performance [21].

The large FP arithmetic capacity of the GPUs makes significant impact on algorithms where the computations can be off-loaded entirely (or mostly) onto the GPUs. As depicted in Fig. 2, the case of the cut-off based method provides as example of this type of workload characteristic. For large workloads (rhodoX3 and rhodoX10) the addition of more GPUs on a single workstation continues to show improvement in the time-to-solution. For a strong scaling problem such as these tests with a fixed amount of computation, in addition to increasing the FP arithmetic capacity from multiple cards, the performance boost also comes from reducing the time to transfer data to and from the GPUs due to the increased bandwidth (multiple PCI connections to multiple GPU cards). However, it should also be noted that the performance gains

TABLE II TEST HARDWARE CONFIGURATIONS INVESTIGATED IN THIS STUDY

Test setup A B C Tesla GPU C1060 C1060 Fermi C2050 (A03) GPU-cores 240 240 448 Speed (MHz) 1,300 1,300 1,150 (1,250a) Memory (MB) 4,096 4,096 3,072 Peak B/W (GB/s) 102 102 (172 a) Peak gigaFLOPs (single/double-precision) 933/78 933/78 (1040/520a)

GPU devices 1 4b 1

Host CPU AMD Opteron 8356 Intel Xeon E5540

Intel Xeon E5520

CPU-cores 16 (4 x quad core) 8c (2 x quad core) 8c (2 x quad core) Speed (GHz) 2.3 2.53 2.27 Memory/core (GB) 4 2 12 OS/gcc version SL 5.0d/4.1.2 RHEL 5.4e/4.1.2 RHEL 5.4e/4.1.2 CUDA version 2.3 2.3 3.0 Nodes/Interconnect 24/Infiniband DDR Single Single

a = expected in general release; b = tests performed used 1, 2 or all 4 of these devices; c = hyper-threading capability was not used in this study; d = Scientific Linux SL, e = Red Hat Enterprise Linux

also come from an increasing number of CPU-cores. The case of smaller workloads such as the JAC benchmark provides an interesting case. Resource sharing of single GPU devices by multiple cores leads to poor performance as a function of cores (see red curve in Fig. 2(c)). This is possibly due to the inefficiencies introduced into the system by dividing the same amount of computation offloading in the FIFO pipelining queue by an increasing number of cores.

For increasing workload (Figs. 2(a) and 2(b)) the behavior is different as the overhead associated with the FIFO pipelining queue starts to make smaller impact on the performance. Therefore, we observe that for smaller workload it may be better to benefit from the GPU devices first rather than sharing the device with all CPU-cores. Overall, the addition of up to 4 GPUs consistently improves the time-to-solution, while keeping the performance gains from the CPU-cores (i.e. the time of GPU-enabled runs is always better than the CPU-only runs with same number of cores).

The PME (PPPM) method offloading also indicates similar trends (see Fig. 3). Increasing workload continues to benefit from the increasing number of multi-cores. The time-to-solution continues to show a decrease with the addition of multiple GPUs. The application speed-up of the GPU-enabled simulations; however, is lower than the cut-off based method. This is a result of a relatively smaller amount of computation off-loading to the GPUs as well as the different scaling characteristics of the algorithm due to the increased communication between the CPU-cores for the FFT, required for the k-space computation. In this case the splitting of same amount of work into the FIFO queue shows much less impact on the overall scaling characteristic, even for smaller workloads (see Fig. 3(c)). This can be explained due to the dependence of the overall application on the communication patterns between the MPI tasks and the computations on the CPU-host. Therefore, we observe that the computation off-loading characteristics and the benefits from GPUs may be more robust for algorithms with dependence on the communication between the cores. These inherent inefficiencies associated with the CPU to/from GPU data-transfer and overheads associated with the computation pipelining in the FIFO queues can be relatively overlapped with other MPI related communications. Eventually these costs may start to over-ride the other communication costs and impact the overall performance (for example, as shown in Fig. 3(c) the JAC test on the 8 cores with 1 Tesla C1060 card shows poorer time-to-solution than the 4 cores with 1 Tesla C1060 card).

C. Scaling in a GPU-enabled multi-node environment The availability of inexpensive GPUs makes them a

potential solution for increasing the available computing power by adding them to commodity Linux clusters. The question remains about the impact of these devices on the overall performance. Concerns exist about the ability to benefit from these devices in large-scale runs where the appli-

Fig. 2. Improvement in time-to-solution with GPU-enabled LAMMPS for the cut-off based MD simulations. Tests were performed with Tesla C1060 GPUs and hardware configuration B (Table II); all tests were performed for 1000 steps with 2 fs time-step (time-to-solution for CPU-only version in the inset).

Fig. 3. Improvement in time-to-solution with GPU-enabled LAMMPS for the PPPM (PME) method based MD simulations. Tests were performed with Tesla C1060 GPUs and hardware configuration B (Table II); all tests were performed for 1000 steps with 2 fs time-step. The inset provides the time-to-solution for the CPU-only version.

cation utilizes CPU-cores over a large number of nodes connected by a network. Therefore, we investigated the scaling characteristics of our implementation with increasing number of nodes. Note that this 24 node cluster has 1 Tesla C1060 card available per node in conjunction with 4 quad-core processors (total 16 CPU-cores/node). Therefore using the pipelining mechanism, we show results of offloaded computations from 1, 2, 4, 8 or 16 cores onto a single card on every node.

TABLE III TIME-TO-SOLUTION FOR RHODOX03/CUT-OFF ON GPU CLUSTER

LAMMPS (CPU-only) Nodes 1 c/na 2 c/n 4 c/n 8 c/n 16 c/n

1 2594.5 1314.5 673.1 343.6 179.0 2 1327.5 667.4 347.0 176.8 94.7 4 667.6 340.2 177.2 92.7 54.3 8 337.7 176.4 91.5 49.6 36.4

16 174.1 89.2 47.5 29.4 28.1* 24 117.6 60.5 34.8 26.3 33.0

GPU-enabled LAMMPS (1 C1060/node) Nodes 1 c/n 2 c/n 4 c/n 8 c/n 16 c/n

1 265.58 176.88 144.66 147.03 202.21 2 126.22 90.31 81.71 105.14 157.66 4 60.28 47.99 57.57 83.71 128.06 8 32.94 31.63 45.09 66.99 117.14

16 20.30 24.35 35.06 60.40 106.06 24 15.6* 20.52 34.65 60.23 108.43

Results based on 1000 time-steps of 2 fs for rhodoX3/cut-off simulations. a = cores/node; * = best time-to-solution

TABLE IV TIME-TO-SOLUTION FOR RHODOX10/PME ON GPU CLUSTER

LAMMPS (CPU-only) Nodes 1 c/na 2 c/n 4 c/n 8 c/n 16 c/n

1 15060.0 7586.9 3915.9 2007.8 1024.1 2 7532.6 3927.6 1990.6 1052.9 580.5 4 3920.3 1948.1 1028.9 559.2 302.4 8 1956.0 1002.8 528.1 279.5 192.6

16 992.0 521.0 262.8 168.5 139.9* 24 673.8 335.0 188.7 145.1 214.5

GPU-enabled LAMMPS (1 C1060/node) Nodes 1 c/n 2 c/n 4 c/n 8 c/n 16 c/n

1 3005.5 1749.9 1191.4 825.0 890.6 2 1304.5 817.9 544.8 515.1 480.6 4 598.6 382.2 333.3 297.0 368.6 8 297.9 213.2 180.1 202.0 311.7

16 167.3 126.7 118.8 176.5 311.1 24 111.1 89.2* 108.3 196.3 371.1

Results based on 1000 time-steps of 2 fs for rhodoX10/PME simulations. a = cores/node; * = best time-to-solution

Tables III & IV indicate that for the biological problems in the multi-node cluster environment the performance of 4-8 nodes with a single GPU device provide similar performance to 16-24 nodes with no GPU. As the emphasis of this study is to improve the time-to-solution, therefore, comparison of wall-clock timings as reported by LAMMPS are compared in Tables III & IV for different configurations. Even though this cluster has an Infiniband DDR interconnect, utilization of an increasing number of cores/node leads to significant communication and poor scaling. For CPU-only runs, the use of 16 cores/node leads to worse time-to-solution particularly at 24 nodes. When GPUs are used the time-to-solution shows significant improvement, particularly at lower node counts and with use of a smaller number of cores/node. This is due to the scaling characteristic of LAMMPS on Infiniband at a high-core count. Note that the best time-to-solution for CPU-only runs (indicated by * in Tables III & IV) is worse than several GPU-enabled simulations at lower total core counts. Fig. 4 indicates that our implementation is able to enjoy performance gains from a single GPU device. For all nodes with 1 to 8 cores utilized per node the simulation realized continued benefit. However, for the fixed simulation size the use of all 16 cores starts to show application slowdown. This is a similar observation as for single node performance (see Fig. 3) due to scaling characteristics of the application. Based on the results from a single node (Figs. 2 and 3), it could be envisioned that the performance will improve with increasing number of GPU devices available per node. Therefore, we observe that for small clusters (1) it may be beneficial to use 1 or 2 CPU-cores with GPUs per node, and (2) giving the application scaling characteristics (Figs. 2 and 3) more GPUs/node will provide better performance than larger node count with high processors count. However, better network may also allow overcoming the scaling limitations posed by the high CPU-core counts.

D. NVIDIA's Fermi provides significant improvement in double-precision performance

One of the shortcomings noted for the Tesla 10-series GPUs is the relatively poor double-precision performance. The upcoming generation of the GPUs optimized for scientific applications, 20-series Fermi, addresses this problem. Biomolecular simulations, like a large number of other scientific applications, require double-precision arithmetic for accurate and numerically stable results. As shown in Fig. 5, the use of mixed precision (double-precision on the CPU and single-precision on the GPU) provides qualitatively similar results. This test was performed under constant particle (N), volume (V) and energy (E) conditions, or the NVE ensemble. Energy conservation is a commonly used criterion for judging the algorithmic and numerical stability of MD simulations. In Fig. 5, the total system energy for the system (kinetic + potential energy) is compared for CPU-only and GPU computations. The results show that even though single- precision on GPUs shows qualitative agreement with the CPU-only results, there are large variations in the energy due

Fig. 4. Application speed-up observed on 24 node GPU cluster. (a) rhodoX3/cut-off (b) rhodoX10/PME were used as representative systems. The cluster configuration is described in Table II (setup A). The cluster nodes are interconnected by Infiniband DDR. to the difference in arithmetic. The use of double-precision on the GPU significantly improves the agreement with the CPU-only results.

One of the attractive features of the Fermi cards is that they are expected to provide substantial improvement over the current generation Tesla cards (expected more than 6-fold improvement as shown in Table II). We tested our software implementation's ability to exploit the increased double-precision FP arithmetic. As depicted in Fig. 6, results show that the Fermi C2050 (A03) card shows a significant improvement in the time-to-solution. Note that these tests were performed on an early hardware version (A03) with courtesy access by the vendor. This hardware version has 448 CUDA-cores as compared to 240 that are available in C1060. For the three test cases (with PPPM method) the time-to-solution shows significant improvement at a workstation level. The scaling behavior is similar to the workstation with a single C1060 device. Note that there was only one device that was shared by up to 8 cores through the pipelining mechanism. Due to the similarity in the performance of the

Fig. 5. Simulation accuracy based on single versus double-precision on GPUs. This test was performed with the JAC test case, the PPPM (PME) method, and a time-step of 2 fs. Hardware configuration B in Table II was used.

Fig. 6. Improvement in double-precision time-to-solution with Fermi. Hardware configuration B in Table II was used for the Tesla 10-series as well as CPU-only tests, and configuration C for Fermi (Tesla 20-series) tests. For double-precision on GPU, v3.0 of the CUDA driver was used. C1060, it is expected that as more C2050 devices are added to the workstation, performance gains will be achieved (see Figs. 2 and 3). This is particularly expected to be the case for larger systems (with increased number of atoms).

In the near future, it is expected that Tesla 20-series cards with increased memory capacities would be available to the end-user community. As Table II indicates, the improvement in single-precision capacity of 20-series Fermi over Tesla 10-series is small; however, double-precision capacity is expected to be about 6 times more than the current 10-series cards. Based on the results presented here, we observe that the double-precision performance of our implementation on C2050 is similar to the single-precision performance of C1060 cards (see Figs. 3 & 6). Therefore, we project that the performance of GPU-clusters with Tesla 20-series in double- precision could provide similar scaling characteristics as seen in Fig. 4, leading to 5-10 fold speed-up over CPU-only simulations for typical system sizes.

V. CONCLUSIONS AND FUTURE DIRECTIONS Next-generation computing resources are expected to

deviate from the conventional path of more concurrency in a homogeneous environment. The hybrid systems, which are already beginning to take shape, will have a multi-level hierarchy of concurrency from heterogeneous resources. In addition to the traditional paradigm of coupling processors (with a few multi-cores CPU) through a network, increasing number of sockets and multi-core processors will significantly increase the concurrency even within a single node. Further, the use of non-traditional FP arithmetic accelerators such as GPUs will lead to heterogeneity in the hierarchy.

In order to harness the huge potential of these emerging hybrid computing resources, code and algorithm developers need to target complex and rather restrictive programming models. But, as we demonstrated here, algorithm and application implementations can be parameterized to enable the end users of scientific applications to control and optimally exploit resources at hand, or assist them in planning for future system procurements. Our implementation methodology shows how this could be achieved even within a production-level application framework and for problem sizes that are of interest to the end user community. The systematic study and results validate our tuning approach as we demonstrate reduced time-to-solution over multiple system configurations and problem sizes. The key contributions of this paper are therefore: (1) a detailed parametric study with no inherent restriction on the number of host CPU and GPU-cores; (2) identified critical factors influencing performance on hybrid workstations and clusters; (3) observation that overlapping both computations and communications from heterogeneous resources could improve performance; and (4) validation of improved double-precision results on an early-access Fermi as compared to its predecessor Tesla HPC GPU device.

A number of research and development challenges have been identified throughout this paper. In order of priority, we plan on targeting the overlapping and load balancing over the multiple CPU and GPU cores once multiple kernel execution is supported. We also plan on studying the feasibility of thread-based approaches such as pthreads or OpenMP for this task. We will investigate performance on larger heterogeneous systems. One of our long-term goals is to formulate a performance model such that end users could explore the design parameter space without having empirical data readily available to them.

ACKNOWLEDGMENTS We would like to thank Duncan Poole, Peng Wang, and

Steve Harpster of NVIDIA for their technical assistance and early access to the Fermi (A03) card. We also thank Ricky Kendall, Chris Fuson of NCCS for assistance with Lens and other resources. Financial support for this work was provided by NIH (R21GM083946) and ORNL’s LDRD fund.

REFERENCES [1] CAPS HMPP Workbench http://www.caps-entreprise.com [2] GPGPU Developer Resources http://gpgpu.org/developer [3] LAMMPS Molecular Developer Simulator http://lammps.sandia.gov/ [4] NVIDIA Fermi Architecture White paper available from

http://www.nvidia.com/object/fermi_architecture.html [5] NVIDIA GPU Computing Developer Home Page,

http://developer.nvidia.com/object/gpucomputing.html [6] OpenCL—The open standard for parallel programming of

heterogeneous systems. URL: http://www.khronos.org/opencl/ [7] PGI Accelerator Compilers, http://www.pgroup.com/resources/accel.htm [8] P. K. Agarwal, and S. R. Alam. Biomolecular Simulations on Petascale:

Promises and Challenges. J. Physics Conference Series (2006) 46, 327-333.

[9] S. R. Alam, et al. Using FPGA devices to accelerate biomolecular simulations. Computer, vol. 40, pp. 66-73, Mar 2007.

[10] S. R. Alam, et al. Cray XT4: an early evaluation for petascale scientific simulation. Proc. ACM/IEEE Supercomputing, 2007.

[11] J. A. Anderson, et al. General purpose molecular dynamics simulations fully implemented on graphics processing units. J. Comp. Physics, vol. 227, May 2008.

[12] J S. Baghsorkhi, et al. An Adaptive Performance Modeling Tool for GPU Architectures.15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010)

[13] D. Cederman and P. Tsigas. A Practical Quicksort Algorithm for Graphics Processors. Proc. 16th Annual European Symposium on Algorithms (ESA 2008), Lecture Notes in Computer Science Vol.: 5193, Springer-Verlag 2008.

[14] A. Cevahir, et. al. Fast Conjugate Gradients with Multiple GPUs. Int. Conf. on Computational Science (ICCS) 2009, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, pp 893-903

[15] J. Choi, et al. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010)

[16] E. Chow, et al. Desmond Performance on a Cluster of Multicore Processors. DE Shaw Technical Report 2008. (URL: http://www.deshawresearch.com/publications.html)

[17] T. Darden, et al. Particle mesh Ewald: An n*log(n) method for Ewald sums in large systems. J. Chem. Phys. vol. 98, 1993.

[18] M. S. Friedrichs, et al. Accelerating Molecular Dynamic Simulation on Graphics Processing Units. J. Comp. Chemistry, vol. 30, Apr 2009.

[19] J. N. Glosli, et. al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability. Proc. ACM/IEEE Supercomputing Conference, 2007.

[20] D. Grice, et al. Breaking the Petaflops Barrier. IBM J. Res. Dev., vol.53, no. 5, 2009.

[21] S. S. Hampton, et al. Towards Microsecond Biological Molecular Dynamics Simulations on Hybrid Processors, Int. Conf. on High Performance Computing & Simulation (HPCS), 2010.

[22] D. J. Hardy, et al. Multilevel summation of electrostatic potentials using graphics processing units. Parallel Computing, vol. 35, Mar 2009.

[23] M. J. Harvey and G. De Fabritiis. An Implementation of the Smooth Particle Mesh Ewald Method on GPU Hardware. J. Chemical Theory and Computation, vol. 5, Sep 2009.

[24] M. C. Herbordt, et al. Computing Models for FPGA-Based Accelerators. Computing in Science & Engineering, vol. 10, Nov-Dec 2008.

[25] W. Hwu et al. Implicitly parallel programming models for thousand-core microprocessors. Proc. Design Automation Conference, 2007

[26] B. Jang, et. al. Multi GPU implementation of iterative tomographic reconstruction algorithms. Proc. Sixth IEEE Int. Symposium on Biomedical Imaging: From Nano To Macro (Boston, Massachusetts, USA, June 28 - July 01, 2009). IEEE Press, Piscataway, NJ, 185-188.

[27] IBM journal of Research and Development staff. Overview of the IBM Blue Gene/P project. IBM J. Res. Dev., vol. 52, no.1/2, 2008.

[28] E. Lindholm, et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, vol. 28, Mar-Apr 2008.

[29] J. D. Owens, et al. GPU computing. Proceedings of the IEEE, vol. 96, May 2008.

[30] S. Plimpton. Fast Parallel Algorithms for Short-Range Molecular-Dynamics. J. Comp. Physics, vol. 117, Mar 1995.

[31] D. E. Shaw, et al. Anton, a special-purpose machine for molecular dynamics simulation. Communications of the ACM, vol. 51, pp. 91-97, Jul 2008.

[32] J. E. Stone, et al. Accelerating molecular modeling applications with graphics processors. J. Comp. Chemistry, vol. 28, Dec 2007.

[33] J. V. Sumanth et al. Performance and cost effectiveness of a cluster of workstations and MD-GRAPE 2 for MD simulations. Proc. Symposium on Parallel and Distributed Computing, 2003.

[34] R. Yokota, et. al. Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence. Computer Physics Communications, vol. 180, 2009, pp 2066-2078

Documents

Optimal utilization of heterogeneous resources for ...korobkin/tmp/SC10/papers/pdfs/...some heterogeneous resources as demonstrated in the first Petaflops system called Roadrunner