6
Developing New Parallelization Techniques for Emerging HPC Architectures and Life Science Applications Hai Tao Sun, Michael Yu Sun IBM Corporation, Building 30-2, 3605 Highway N, Rochester, MN 55901 USA Email: [email protected] Abstract Parallelization of chemistry and drug discovery codes to leverage emerging High-Performance Computing (HPC) architectures is a difficult task. However, it is a task that is required for one to be able to simulate biologically important molecular systems which are not accessible with current technology. In addition to parallelizing existing chemistry and drug discovery codes, it is important to explore at the same time new methodologies that address limitations in methods currently being used for molecular simulations. In this study we combine speed and accuracy by parallelizing a new promising explicit polarization (X-Pol) method, which is a method that addresses several limitations in today’s methodologies. The explicit polarization (X-Pol) method, also called the X-Pol potential, provides a method for treating both bonded and nonbonded interactions using electronic structure theory. Bonded interactions are treated by an iterative self- consistent field method, and nonbonded interactions are treated by electronic embedding. In this approach partial charges are obtained using electronic structure methods applied to the individual fragments. When the method is applied to a protein, the fragments are defined as peptide units. In our research we intend to use of peptide fragments as defined in the X-Pol method as the basic units for parallelization. The peptide fragments are mapped onto a HPC hybrid architecture where shared-memory processors are interconnected by a customized or standard network. In our method a pre-selected number of fragments are parallelized via shared-memory algorithms, while message- passing algorithms are used across processors for neighboring fragments. Keywords: Massively parallel computing, HPC architecture, life science, Blue Gene, Cyclops, supercomputer. 1. Introduction It is well known that parallel computing is a powerful tool to reduce the time for large simulations. All supercomputers today rely on collections of highly interconnected microprocessors. This is not surprising: As the clock speed of a single processor approaches its physical limitations, it becomes more difficult to improve performance by simply increasing the clock speed of a single microprocessor. In fact, some massively parallel machines have lowered the clock speed to accommodate several million processors. The use of parallel computers in quantum chemistry was introduced by Clementi and co-workers in the 1980’s [1]. Their computer systems, IBM 3090 and IBM 4341, consisted of loosely coupled arrays of processors; that was the leading edge technology at that time. There are five sections in this paper. Section 1 is an introduction, including background and specific aims. Section 2 introduces IBM supercomputers with emerging architectures. Section 3 describes the research design and methods. Section 4 is discussion and conclusions, and section 5 highlights future works. 1.1 Background Molecular simulations are very important not only in the academic area but also in industry. Current technology used for drug discovery relies on methods based on approximations that do not always describe the interactions properly. Enabling more accurate methodologies with parallel programming techniques may provide a quantum leap in terms of simulations that can be done in drug discovery. Furthermore, parallel computation is becoming more and more important in molecular simulations, because the current computing power and speed of sequential computation does not sufficiently meet the requirements of quantum mechanical calculations. Emerging supercomputer architectures are essential to provide the next generation of scientific simulations. However, the challenge is for application developers to leverage these new architectures. This includes rewriting application programs or developing new algorithms and techniques that can improve the efficiency of computing applications in various fields, such as, molecular simulations. Some early work involved enabling the parallel version of Gaussian 94, a computational chemistry software program, on a distributed-memory massively parallel supercomputer [ 2 ]. This work included analysis of the scalability of methods such as configuration iteration with single excitations (CIS), Moller-Plesset theory up to second order (MP2), multi-configuration self-consistent theory (MSCF) calculations, the Hartree-Fock method, and density functional theory (DFT) [2]. This work was extended to include parallelization on shared-memory machines. OpenMP is a standard for parallel programming on shared-memory computers. The first OpenMP implementation of Gaussian 98, a computational chemistry software program, was reported in Ref. [3]. It is beyond the scope of this paper to review all the work that has been done in parallel computing in quantum chemistry, but there are excellent reviews available [4]. 978-1-4673-0818-2/12/$31.00 ©2012 IEEE

[IEEE 2012 IEEE International Conference on Electro/Information Technology (EIT 2012) - Indianapolis, IN, USA (2012.05.6-2012.05.8)] 2012 IEEE International Conference on Electro/Information

Embed Size (px)

Citation preview

Developing New Parallelization Techniques for Emerging HPC Architectures and Life Science

Applications

Hai Tao Sun, Michael Yu Sun IBM Corporation, Building 30-2, 3605 Highway N, Rochester, MN 55901 USA

Email: [email protected]

Abstract Parallelization of chemistry and drug discovery codes to leverage emerging High-Performance Computing (HPC) architectures is a difficult task. However, it is a task that is required for one to be able to simulate biologically important molecular systems which are not accessible with current technology. In addition to parallelizing existing chemistry and drug discovery codes, it is important to explore at the same time new methodologies that address limitations in methods currently being used for molecular simulations. In this study we combine speed and accuracy by parallelizing a new promising explicit polarization (X-Pol) method, which is a method that addresses several limitations in today’s methodologies. The explicit polarization (X-Pol) method, also called the X-Pol potential, provides a method for treating both bonded and nonbonded interactions using electronic structure theory. Bonded interactions are treated by an iterative self-consistent field method, and nonbonded interactions are treated by electronic embedding. In this approach partial charges are obtained using electronic structure methods applied to the individual fragments. When the method is applied to a protein, the fragments are defined as peptide units. In our research we intend to use of peptide fragments as defined in the X-Pol method as the basic units for parallelization. The peptide fragments are mapped onto a HPC hybrid architecture where shared-memory processors are interconnected by a customized or standard network. In our method a pre-selected number of fragments are parallelized via shared-memory algorithms, while message-passing algorithms are used across processors for neighboring fragments. Keywords: Massively parallel computing, HPC architecture, life science, Blue Gene, Cyclops, supercomputer.

1. Introduction

It is well known that parallel computing is a powerful tool to reduce the time for large simulations. All supercomputers today rely on collections of highly interconnected microprocessors. This is not surprising: As the clock speed of a single processor approaches its physical limitations, it becomes more difficult to improve performance by simply increasing the clock speed of a single microprocessor. In fact, some massively parallel machines have lowered the clock speed to accommodate several million processors. The use of parallel computers in quantum chemistry was introduced by Clementi and co-workers in the 1980’s [1]. Their computer systems,

IBM 3090 and IBM 4341, consisted of loosely coupled arrays of processors; that was the leading edge technology at that time.

There are five sections in this paper. Section 1 is an introduction, including background and specific aims. Section 2 introduces IBM supercomputers with emerging architectures. Section 3 describes the research design and methods. Section 4 is discussion and conclusions, and section 5 highlights future works.

1.1 Background

Molecular simulations are very important not only in the academic area but also in industry. Current technology used for drug discovery relies on methods based on approximations that do not always describe the interactions properly. Enabling more accurate methodologies with parallel programming techniques may provide a quantum leap in terms of simulations that can be done in drug discovery. Furthermore, parallel computation is becoming more and more important in molecular simulations, because the current computing power and speed of sequential computation does not sufficiently meet the requirements of quantum mechanical calculations. Emerging supercomputer architectures are essential to provide the next generation of scientific simulations. However, the challenge is for application developers to leverage these new architectures. This includes rewriting application programs or developing new algorithms and techniques that can improve the efficiency of computing applications in various fields, such as, molecular simulations.

Some early work involved enabling the parallel version of Gaussian 94, a computational chemistry software program, on a distributed-memory massively parallel supercomputer [ 2 ]. This work included analysis of the scalability of methods such as configuration iteration with single excitations (CIS), Moller-Plesset theory up to second order (MP2), multi-configuration self-consistent theory (MSCF) calculations, the Hartree-Fock method, and density functional theory (DFT) [2]. This work was extended to include parallelization on shared-memory machines. OpenMP is a standard for parallel programming on shared-memory computers. The first OpenMP implementation of Gaussian 98, a computational chemistry software program, was reported in Ref. [3]. It is beyond the scope of this paper to review all the work that has been done in parallel computing in quantum chemistry, but there are excellent reviews available [4].

978-1-4673-0818-2/12/$31.00 ©2012 IEEE

The kind of new methodology employed here combines classical molecular dynamics with electronic structure methods [5, 6, 7]. This methodology has helped to overcome some of the limitations inherent in molecular mechanics force fields. The specific method to be employed is the explicit polarization (X-Pol) method developed by Gao and Truhlar [8, 9]. The X-Pol method is an electronic structure-based quantum mechanical theory and simulation method designed as a next-generation force field [8]. It provides a novel method for computing bonded interactions using electronic structure theory. The X-Pol method simulates non-bonded interactions by an embedding approach reminiscent of molecular mechanical methods [8, 9].

There are three reasons for pursuing this study. First, computer programs based on molecular mechanics (MM) and quantum mechanics (QM) tend to be CPU intensive. This limits the size of the molecular systems that can be simulated. Parallelization of this type of application is currently the most practical way to carry out calculations on biologically relevant systems. Second, parallelizing applications such as the explicit polarization (X-Pol) method on new HPC supercomputers benefits the scientific community by developing new algorithms that can leverage these architectures and may be applied to other similar or more complex problems. Finally, this research will create an opportunity to explore HPC architectures and will identify hardware features that are important when designing these systems. The next generation of exascale computers, represents a significant challenge not only from the point of view of hardware design, but also to application developers as well. 1.2 Specific aims

Specifically, in this research we propose to accomplish the X-Pol potential parallelization by providing three contributions. First, we propose to develop new parallel algorithms for the X-Pol method to distribute workloads from multiple fragments to multiple processors. Second, we propose to carry out the parallelization of the X-Pol potential by exploiting hardware architecture, operating system, compiler and parallel programming models. Third, we propose to divide the parallelization into multiple steps. Our first step will include developing a hybrid parallel approach based on two standards, OpenMP [10] and MPI [11]. In subsequent steps, we will exploit other parallel models such as Partitioned Global Address Space (PGAS) [12] and OpenCL [13].

This research work will be tailored to leverage new petaFLOPS supercomputer architectures. We will also explore requirements for future exascale supercomputers. All of these will be applicable to any other programs that are based on methodologies similar to the X-Pol potential.

The main challenge of this research is to determine how to leverage the new HPC architecture and development environment. Parallelizing applications to extreme numbers of processors is not a trivial task, and it requires new algorithms to properly distribute all the work and keep thousands of processors busy. Balancing communication and computation requires substantial research.

2. IBM Supercomputers with Emerging New HPC Architectures

IBM has developed several models of supercomputers with new HPC architectures, including the Blue Gene supercomputer. The operation speed of the Blue Gene supercomputer is from 360 teraflops to 20 petaFLOPS [14]. The Blue Gene supercomputer has several advantages as compared to other supercomputers [14,15, 16]. For example, the Blue Gene supercomputer improves the efficiency of parallel systems by optimizing communication between processors and threads. The Blue Gene supercomputer uses high performance networks for synchronization and communication. In addition, the Blue Gene supercomputer is built on standards, such as OpenMP and MPI. There are four different types of Blue Gene supercomputers that have been developed or are currently being developed, in particular Blue Gene/C, Blue Gene/L, Blue Gene/P, and Blue Gene/Q [14,15,16, 17, 18].

Cyclops supercomputer or Blue Gene/C is developed by IBM and has many unique features [19, 20, 21, 22]. The whole Cyclops supercomputer system includes 96 racks, 288 middle planes, 13,824 Cyclops blades and 1.1 million C64 processors. Its operation speed of whole system is 1.1 petaFLOPS. Cyclops system is a specialized supercomputer system designed to target at applications of protein folding, biology structure and optimization [19,20,21,22]. Cyclops supercomputer has advanced memory design and communication systems [19]. In addition, each Cyclops blade has 80 processors and each processor has two hardware threads. So, each Cyclops blade has a total of 160 hardware threads. These processors and threads are connected with a 96×96 crossbar switch. A message can be delivered from one processor to another processor through the crossbar switch. These 80 processors and 160 threads, as well as crossbar switch provide a unique opportunity and environment for parallel computation inside a Cyclops blade. A message from one processor to another processor located in a different rack can be delivered through a high speed link. This feature makes it possible to implement distributed parallel computing efficiently [19,20,21,22].

Blue Gene/L and Blue Gene/P are different from Blue Gene/C. Blue Gene/L has up to 64 racks with peak speed 360 teraflops [14,15, 23]. Each rack has 32 computer boards and each computer board has 16 computer cards. Each computer card has two BG/L chips and each BG/L chip has two embedded PowerPC 440 processors, two L2 caches and four megabytes eDRAM. Blue Gene P is the second generation of the Blue Gene supercomputer [17]. It is designed to run continuously at one petaFlops [16,17]. If a user needs more computing power, Blue Gene/P can be configured to reach speeds of three petaFLOPS [15,16,17]. In Blue Gene/P, each BG/P chip has four embedded PowerPC 450 processors, four L2 caches and eight megabytes eDRAM. The software development environment of Blue Gene/L and Blue Gene/P includes high performance kernel, parallel file system and system management [14,15]. The compilers of Blue Gene/L

and Blue Gene/P support C language, C++ language, FORTRAN language, MPI and OpenMP [14,15, 24]. 3. Research Design and Methods 3.1 Combined Quantum Mechanical and Molecular Mechanics

Molecular dynamics simulations based on force fields have proven to be successful when computing conformational changes of large biochemical systems. However, these methods have multiple deficiencies, such as:

(1). The selection of the energy terms and degrees of freedom are somewhat arbitrary when developing a force field [8].

(2). Most force field calculations are based on the harmonic approximation to bond stretching and angle bending with cosine functions for internal rotations, often with no consideration of the coupling between different degrees of freedom [8].

(3). The treatment of many-body polarization effects is challenging, because it is difficult to choose the functional form and empirical parameters [8], and the cost is higher.

(4). Charge transfer effects are not considered in molecular mechanics force fields, and there is not an easy way to include these effects in force field calculations [8].

(5). The form of the empirical potentials used in most cases cannot simulate chemical reactions involving bond formation and bond breaking or regions significantly away from the minimum-energy structure of the electronically adiabatic ground state [8].

The X-Pol potential is a novel method for potential energy calculations [5, 6, 7, 8, 9]. The X-Pol potential provides the next generation of force fields and overcomes the deficiencies of current force fields. It is implemented using semi empirical or ab initio electronic structure theory [5, 6, 7, 8, 9]. The X-Pol potential divides a molecular system into fragments, i.e. peptide unit [8]. The separation is based on the Ca atoms of adjacent peptide unit. The Ca atoms of adjacent peptide unit are equally shared after the separation of biological system is finished. The Ca atoms of adjacent peptide unit are the boundary of QM and MM region. The interactions of the boundary are calculated by generalized hybrid orbital method (GHO), a combined quantum mechanical and molecular mechanical (QM/MM) method [5, 6, 7, 8, 9]. The atoms in peptide unit, except the Ca atoms of adjacent peptide unit, belong to QM regions. The interactions of atoms in QM regions are calculated by Austin Model 1 (AM1), a semi-empirical method. The calculations inside QM region are based on electronic structure theory [8]. The interactions from one peptide unit to another peptide unit are calculated using generalized hybrid orbital (GHO) method. Basing the theory on fragments makes the X-Pol method very attractive because it can be made to scale linearly with system size [8, 9].

3.2 Parallel Computing Models

Parallel computing is divided into hardware parallel computing and software parallel computing. Hardware parallel computing is also called implicitly parallel computing. It is

designed and implemented inside computer hardware, such as, pipelines, multiple threads, multiple cores, multiple processors, and multiple memory banks. Software parallel computing is in application layer and operating system layer. It is implemented with different parallel languages, such as, OpenMP, MPI and OpenCL.

Software parallel computation is becoming more and more important, because some trends in hardware design indicate that implicitly parallel architecture may not meet the requirement of computing performance in the future [25]. In the process of software parallel computation, multiple calculations are carried out simultaneously [25]. The main purpose of parallel computation is to increase computing power, reduce computing cost and improve application performance [25,26].

The OpenMP (open multi-processing) programming model is a shard-memory programming model as shown in Figure 1. It supports C, C++ and FORTRAN in a shared memory system through fast switches.

Figure 1. Shared-Memory programming model

[OpenMP] [27] The MPI (message passing interface) is a distributed-

memory programming model as shown in Figure 2. The MPI interfaces provide message passing topology, synchronization, and communication commands between a set of processors. It supports C, C++ and FORTRAN in distributed systems through fast switches and networks.

Figure 2. Distributed-Memory programming model

[MPI] [27] The PGAS (partitioned global address space) is a new

parallel programming model in which the memory is divided into two categories, i.e. global address and partitioned address [28]. PGAS models enhance performance by exposing data or thread locality. The performance of PGAS is better than two-sided MPI [28]. 3.3 Design of Parallel X-Pol Software

Our initial parallelization design relies on the fact that X-Pol divides a polypeptide of N residues into N subunits or fragments. Since there is an interaction between the fragments, we will determine the optimal number of fragments that can be

Switch

Processors 0

Memory

1 2 3 4

Network

Processors 0

Memory

1 2 3 4

clustered together. Once this is done, via message passing, we will distribute each cluster, a group of fragments, to a different compute node, a communication endpoint running one instance of the operating system and executing the computation. Each compute node will be performing the SCF procedure in parallel using OpenMP.

The host machine or the front end of the supercomputer assigns a node, a communication endpoint, as a control node. The control node is a general purpose node which helps accelerate sequential code segments or workloads. The control node sorts fragments into groups of fragments and then, distributes these fragment groups to different compute nodes. After receiving fragments, a compute node allocates fragments to processors if the compute node utilizes multiple processors. Multiple compute nodes will conduct energy computations simultaneously. In the process of computing, a leading compute node needs to communicate with a partner compute node to exchange energy information. The partner compute node should be in the same row and near the leading compute node in order to save communication time. High speed links need to be connected by the host machine before running tasks. After finishing all fragment energy computations, the control node will gather all results from different processors through compute nodes and return the calculated energy and force to CHARMM (Chemistry at Harvard Molecular Mechanics), as shown in Figure 3.

Figure 3: Parallel Design and Implementation of X-Pol. By courtesy: Fragments Image source: Xie & Gao J.

Chem. Theory Comput. 2007, 3, 1890-1904 4. Discussion and Conclusions

In the process of parallel processing, the total calculation time is divided into two components: one is the calculation time on each processor and another is the time required for transferring the data between processors [25]. Speed-up is related to run-time on one processor and run-time

on multiple processors. It is defined as the ratio of run-time on one processor to run-time on multiple processors [25]. The number of molecules in an input file will depend on the nature of the system, for example, the water dimmer input file has two molecules. However, a protein that is “hydrated” is going to have a large molecule (protein) that is surrounded by hundreds of water molecules.

As the number of processors changes, the performance of the simulation will change [18, 19, 20, 29]. Fragments can be grouped together and distributed to different nodes. The number of fragments in each group impacts performance [8, 19, 20, 26].

There are six atoms, nine atom pairs, two residues, four angles, and four bonds in water dimer input file. The water dimer input file has been executed by the sequential X-Pol program. A gprof profiler tool has been used to profile the sequential X-Pol program results. The profiling data of the sequential X-Pol is shown in table 1. Table -1: Program Profile Data

Each sample counts as 0.01 seconds. % Cumulative

time Self time

Function calls/Names

39.78 619.06 619.06 qm2_mat_diag_xpol_ 37.87 1208.29 589.23 xpol_energy_iter_ 4.51 1278.54 70.25 construct_fock_2c2e_frag_ 4.30 1345.45 66.91 construct_hmmdensity_ 3.82 1404.95 59.50 construct_hmmcharge_ 2.25 1440.03 35.08 qm2_densit_xpol_ 1.65 1465.66 25.63 qmcore_mmcharge_ 0.92 1479.94 14.28 construct_fock_1c2e_frag_ 0.56 1488.70 8.76 calculate_qmmm_1e_integrals_samm_ 0.52 1496.77 8.07 construct_fock_2c2e_ 0.48 1504.28 7.51 transform_density_ 0.48 1511.77 7.49 qm2_deriv_qmmm_heavy_xpol_ 0.34 1517.07 5.30 qm2_cnvg_xpol_ 0.33 1522.21 5.14 transform_fock_ 0.29 1526.80 4.59 __intel_new_memcpy 0.27 1531.06 4.26 qm2_get_qmmm_forces_xpol_ 0.21 1534.40 3.34 setupqmlist_ 0.16 1536.92 2.52 nbonda_ 0.16 1539.38 2.46 construct_fock_1c2e_ 0.14 1541.57 2.19 bondtype_ 0.13 1543.60 2.03 exp.L 0.12 1545.53 1.93 enbfs8_ 0.12 1547.36 1.83 for_check_mult_overflow64 0.07 1548.51 1.15 vbftn_xpol_ 0.06 1549.38 0.87 __intel_new_memset 0.05 1550.15 0.77 xyzpbound_ 0.04 1550.84 0.69 _int_free 0.04 1551.51 0.67 _int_malloc 0.04 1552.12 0.61 Malloc 0.02 1552.47 0.35 Cfree 0.02 1552.79 0.32 for_allocate 0.02 1553.09 0.30 for_cpstr 0.02 1553.39 0.30 qm2_deriv_qm_analyt_xpol_ 0.02 1553.67 0.28 sYSTRIm 0.02 1553.91 0.24 for_dealloc_allocatable 0.01 1554.14 0.23 qm2_deriv_coulomb_xpol_ 0.01 1554.32 0.18 _intel_fast_memcmp 0.01 1554.49 0.17 __write_nocancel 0.01 1554.61 0.12 _intel_fast_memcpy 0.01 1554.73 0.12 munmap 0.01 1554.84 0.11 cread_ 0.01 1554.95 0.11 exp

The profiling data of the sequential X-Pol program

indicate that some functions take more time to execute and some functions take less time to execute. The two functions that take the most CPU time are qm2_mat_diag_xpol (diagonalization function) and xpol_energy_iter (energy calculation function). Qm2_mat_diag_xpol spends 39.78% CPU time and xpol_enger_iter spends 37.87% CPU time. The cost of qm2_mat_diag_xpol (diagonalization function) is O(n^3) and the cost for xpol_energy_iter (energy calculation function) is O(n^2). These two functions are the bottleneck of the energy and force calculation. Therefore, they are selected as initial targets of parallelization.

An initial MPI X-Pol program has been compiled and run with five parallel strategies and one sequential strategy for water dimmer input file. The parallel strategies include one node and two processors, one node and four processors, one node and eight processors, two nodes and four processors, as well as two nodes and eight processors. The calculation results, i.e. the calculated energies of biology system, are same for five different parallel strategies and one sequential strategy, i.e. the total calculated energy is -120.73064 kcal/mol for all

strategies. However, the computing efficiency is different for different strategies. For example, the total calculation time for the strategies, one node and two processors, one node and four processors, one node and eight processors are 0.29 seconds, 0.05 seconds and 0.73 seconds. The total calculation time for the strategies, two nodes and four processors, two nodes and eight processors, are 0.50 seconds and 0.28 seconds. 5. Future Works

We propose to test the scalability by evaluating the performance with one processor, 256 processors, 512 processors, 1024 processors, 2048 processors, 8192 processors and 16384 processors in IBM supercomputers. Second, we propose to analyze I/O performance by testing NFS (Network File System) versus GPFS (General Parallel File System). The performance of the simulation may be different for different I/O types. Third, we propose to develop a special program to optimize the fragment numbers in each cluster, and then, to run the parallel X-Pol in several supercomputers with different architecture and designs.

4. References: 1 . E. Clementi, G. Corongiu, J. Dietrich, S. Chin, and L. Domingo, “Parallelism in Quantum Chemistry: Hydrogen bond study in DNA base pairs as an example”, Int. J. Quantum Chemistry, 26,18, 601-618, (1984) 2. C. P. Sosa, J. Ochterski, J. Carpenter, and M. J. Frisch, “Ab Initio Quantum Chemistry on the Cray T3E Massively Parallel Supercomputer”, J. Comp. Chem., 19, 1053-1063 (1998) 3. C. P. Sosa, G. Scalmani, R. Gomperts, and M. J. Frisch, “Ab initio quantum chemistry on a ccNUMA architecture using openMP”, Proceedings of Parallel Computing, 26, 843-856, (2000). 4. C. L. Janssen and I. M. B. Nielsen, Parallel Computing in Quantum Chemistry, CRC Press, Taylor & Francis Group, NY, (2008) 5. J. Gao, “A molecular-orbital derived polarization potential for liquid water”, J. Chem. Phys., 109, 6, 2346, (1998) 6. J. Gao, “Toward a Molecular Orbital Derived Empirical Potential for Liquid Simulations”, J. Phys. Chem. B, 101, 657-663, (1997) 7 . K. Morokuma and K. Kitaura, “Energy Decomposition Analysis of Molecular Interactions”, Institute for Molecular Science, Myodaiji, Okazaki, 444, Japan, (1988) 8. W. Xie, and J. Gao, "Design of a Next Generation Force Field: The X-Pol Potential”, J. Chem. Theory Comput., 3, 1890-1900, (2007). 9. W. Xie, L Song, D. G. Truhlar, and J. Gao, "The variational explicit polarization potential and analytical first derivative of energy: Towards a next generation force field ", J. Chem. Phys., 128, 234108/1-234108/9, (2008). 10. http://openmp.org/wp/, (accessed on July 22, 2011, 2011) 11. http://www.mpi-forum.org, (accessed on July 22, 2011) 12. http://en.wikipedia.org/wiki/Partitioned_global_address_space (accessed on July 22, 2011)

13. http://www.khronos.org/opencl/, (accessed on July 22, 2011) 14. C. Sosa, “IBM System Blue Gene Solution: Blue Gene/P Application Development”, http://www.redbooks.ibm.com/abstracts/sg247287.html, (2010) 15. C. Sosa, “Unfolding the IBM eServer Blue Gene Solution”, http://www.redbooks.ibm.com/abstracts/sg246686.html (accessed on July 22, 2010) 16. IBM, “Blue Gene Project Update”, http://www.research.ibm.com/bluegene/BG_External_Presentation_January_2002.pdf, (accessed on July 22, 2011) 17. Valentina Salapura, “Next Generation Supercomputers”, http://community.anitaborg.org/wiki/images/9/92/GHC07-BlueGene_salapura.pdf , (accessed on July 22, 2011) 18. F. Allen, G. Glmasi, W. Andreoni, D. Beece, B. J. Berne, A. Bright, J. Brunheroto, and C. Cascaval, “Blue: A vision for protein science using a petaflop supercomputer”, http://www.research.ibm.com/journal/sj/402/allen.pdf, (accessed on July 22, 2011) 19. J. D. Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “FAST: A Functionally Accurate Simulation Toolset for the Cyclops64 Cellular Architecture”, Proceeding of The First Annual Workshop on Modeling, Benchmarking, and Simulation, (2005) 20. 64-Bit Cyclops Principles of Operation Part 1, Part 2 and Part3, IBM Corporation, (2010) 21. G. S. Almasi, C. Cascaval, J. G.Castanos, M. Denneau, D. Donath, M. Eleftheriou, M. Giampapa, H. Ho, D. Lieber, J. E. Moreira, D. Newns, M. Snir, H. S., and Warren, Jr . “Dissecting Cyclops: A detailed analysis of a multithreaded architecture”, http://www.research.ibm.com/people/c/cascaval/medea02.pdf (accessed on October 8, 2011) 22. G. R. Gao, “Toward a Scalable Programming Model for High-End Computer System with Many-Core Chip Technology”, HPCweek-04-2008, (2008) 23. G. S. Almasi, C. Cascaval, J. G.Castanos, M. Denneau, D. Donath, M. Eleftheriou, M. Giampapa, H. Ho, D. Lieber, J. E. Moreira, D. Newns, M. Snir, and H. S. Warren, Jr. “Demonstrating the scalability of a molecular

dynamics application on a Petaflops computer”, Proceedings of the 2001 International Conference on Supercomputing, 393- 406, (2001) 24. “BLUE GENE”, http://en.wikipedia.org/wiki/Blue_Gene (accessed on July 20, 2010) 25. A. Grama, A. Gupta, G. Karypis, and V. Kumar, “Introduction to Parallel Computing”, second edition, The Benjamin/Cummings Publishing Company, Inc., ISBN-13:978-0-201-64865-2, (2003) 26. J. Hein, F. Reid, L. Smith, L. Bush, M. Guest and P. Sherwood, “On the performance of molecular dynamics applications on current high-end system”, In philosophical transactions of the Royal Society, Phil. Trans. R. Soc. A doi:10.1098/rsta.1624, (2005) 27. A meeting note from Dr. Carlos J. Sosa

28. K. Yelick, “Performance and Productivity Opportunities using Global Address Space Programming Models”, http://www.sdsc.edu/pmac/workshops/geo2006/pubs/Yelick.pdf, (2006) 29. J. Philips, G. Zheng, S. Kumar, and L. Kale, “NAMD biomolecular simulation on thousands of processors”, Proceedings of the IEEE/ACM SC2002 conference, p.36. Los Alamitos, CA, doi:10.1109/SC.2002.10019, (2002)