Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],

Casper Process-based asynchronous progress

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA

Min Si[1][2], Antonio J. Pea[1], Jeff Hammond[3], Pavan Balaji[1], Yutaka Ishikawa[4][2] University of Tokyo, [email protected][1] Argonne National Laboratory, USA{msi, apenya, balaji}@mcs.anl.gov [3] Intel Labs, [email protected] [4] RIKEN AICS, [email protected] Have worked at Argonne for 1 year, 1

Large Chemical & Biological ApplicationsNWChemQuantum chemistry applicationSWAP-AssemblerBioinformatics applicationMolecular DynamicsSimulation of physical movements of atoms and molecules for materials, chemistry and biology

Min Si, CCGrid2015 - Scale Challenge2

Application Characteristics

Large memory requirement (cannot fit in single node)Irregular data movementNWChem [1][1] M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. van Dam, D. Wang, J. Nieplocha, E. Apra, T.L. Windus, W.A. de Jong, "NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations" Comput. Phys. Commun. 181, 1477 (2010)

Water (H2O)21

Pyrene C16H10Carbon C20High performance computational chemistry application suiteComposed of many types of simulation capabilitiesMolecular Electronic StructureQuantum Mechanics/Molecular MechanicsPseudo potential Plane-Wave Electronic StructureMolecular Dynamics

3Min Si, CCGrid2015 - Scale ChallengeCommunication Runtime4ARMCI : Communication interface for RMA[3]Global Arrays [2]ApplicationsApplicationsApplications[3] http://hpc.pnl.gov/armci[2] http://hpc.pnl.gov/globalarraysLimited platformsLong development cycle time for supporting new platformCrayIBARMCI native portsIBDMMAPPortable implementation on top of MPISupport most platforms !Tianhe-2KMPI RMACrayIBARMCI-MPIAbstractions for distributed arraysGlobal Address SpacePhysically distributed to different processes Hidden from user

Get-Compute-UpdateTypical Get-Compute-Update mode in GA programmingAccumulate block cGETblock bGET block aPerform DGEMM in local buffer for i in I blocks: for j in J blocks: for k in K blocks: GET block a from A GET block b from B c += a * b /*computing*/ end do ACC block c to C end doend doPseudo code5Min Si, CCGrid2015 - Scale ChallengeOutlineProblem StatementSolutionEvaluation

Experimental Environment

NERSC*'s newest supercomputerCray XC302.57 Petaflops/s peak performance133,824 compute cores*National Energy Research Scientific Computing Center

6Min Si, CCGrid2015 - Scale ChallengeNWChem CCSD(T) simulationGold standard CCSD(T)Pareto optimal point of high accuracy relative to computational cost Top of the three-tiered pyramid of methods used for ab initio calculationsSelf-consistent field (SCF)Four-index transformation (4-index) CCSD iteration(T) portionInternal steps in CCSD(T) taskCCSD(T) internal steps in varying water problems(T) portion consistently dominates the entire cost by close to 80%.7Min Si, CCGrid2015 - Scale ChallengeCCSD(T)

MP2

SCFMore accuracyMore computationO(N7)O(N5)O(N3)We are hereCCSD(T) is the "gold standard" quantum chemistry method, which meansthat it provides a very high accuracy relative to computational cost,and in particular, it is a Pareto optimal point

This means that, toget more accurate than CCSD(T), one needs _substantially_ morecomputation, and if one chooses a less accurate method, the loss inaccuracy is greater than the savings in computation.

In any case, CCSD(T) is the most widely used high accuracy method inquantum chemistry, and it is at the top of the three-tiered pyramid ofmethods used for ab initio (quantum mechanical) calculations:

Most chemistry studies use a combination of these, with CCSD(T)applied judiciously to ensure the accuracy of the less expensivemethods.

As for what CCSD(T) is, the CCSD part is the iterative evaluation ofthe coupled-cluster wavefunction with singles and doubles (or singly-and doubly-excited clusters).

The (T) is an a posteriori errorcorrection based upon the perturbative approximation of triples (ortriply-excited clusters) complete through 4th-order, with inclusion ofa subset of 5th-order terms determined empirically to be necessary byPople* and coworkers.

The (T) computation is non-iterative, at leastin a canonical formulation (which is what we are doing).7How to determine SCALABILITY ? Parallel Efficiency

Execution time on minimal number of cores as base T1

If base execution is not efficient ?i.e., inefficient communication

Computational EfficiencyFocus on overhead of inefficient communicationComputation time on minimal number of cores as base Tcomp

8Min Si, CCGrid2015 - Scale Challenge

High base TArtificially High PE(N) !Is (T) Portion Efficient ? (T) Portion strong scaling for W21Parallel Efficiency

Computational Efficiency

Not efficient !9Min Si, CCGrid2015 - Scale Challenge

(T) Portion profiling for W21WHY (T) Portion Is Not Efficient ? One-sided operations are not truly one-sidedOn most platforms, some operations (e.g., 3D accumulates of double precision data) still have to be done in softwareCray MPI (default mode) implements all operations in softwareProcess 0Process 1ComputationRMA (data)Extreme communication delay in computation-intensive (T) portionin large scale problemsSoftware implementation of one-sided operations means that the target process has to make an MPI call to make progress. MPI callDelay10Min Si, CCGrid2015 - Scale ChallengeHow to improve asynchronous progress in communication with minimal impact on computation ? Challenge OutlineProblem StatementSolutionEvaluation

11Min Si, CCGrid2015 - Scale ChallengeTraditional Approaches of ASYNC ProgressThread-based approachEvery MPI process has a communication dedicated background threadBackground thread polls MPI progress process

Interrupt-based approachAssume all hardware resources are busy with user computation on target processesUtilize hardware interrupts to awaken a kernel thread

12Min Si, CCGrid2015 - Scale ChallengeP0P1P2P3T0T1T2T3Cons:Waste 50% computing cores or oversubscribe coresOverhead of Multithreading safety of MPICons:Overhead of frequent interruptsDMMAP-based ASYNC overhead on Cray XC3012Our Solution: Process-based ASYNC Progress

Multi- and many-core architecturesRapidly growing number of coresNot all of the cores are always keeping busyCasper [4]Dedicating arbitrary number of cores to ghost processesGhost process intercepts all RMA operations to the user processesOriginal communicationProcess 0Process 1ComputationRMA(data)MPI callProcess 0Process 1+=ComputationAcc(data)Ghost ProcessP0P1P2PNG0G1Communication with CasperNo multithreading / interrupts overheadFlexible core deploymentPortable PMPI redirection13[4] M. Si, A. J. Pena, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa, Casper: An asynchronous progress model for MPI RMA on many-core architectures, in Parallel and Distributed Processing (IPDPS), 2015.Infinite cores, do not need interrupt13OutlineProblem StatementSolutionEvaluation

Experimental Environment

12-core Intel Ivy Bridge * 2 (24 cores) per nodeCray MPI v6.3.1

14Min Si, CCGrid2015 - Scale Challenge[DEMO] Core Utilization in CCSD(T) SimulationCOMMCOMP15Min Si, CCGrid2015 - Scale ChallengeOriginal MPI (no ASYNC)Casper ASYNC Threads ASYNC on dedicated coresCore Utilization (%)100%50%25%75%100%50%25%75%0

Core Utilization (%)0100%50%25%75%100%50%25%75%Task Processing (%)0100%50%25%75%100%50%25%75%Task Processing (%)Oversubscribed ASYNC cores are polling MPI progress100%50%25%75%100%50%25%75%0Threads ASYNC on oversubscribed coresDONEConcurrent COMM & COMPHigher Computation Utilization is Better !Mention too many core hours to finish the demo for w21..15Strong Scaling of (T) Portion for W21 Problem# COMP# ASYNCOriginal MPI240Casper231Thread (O) (with oversubscribed cores)2424Thread (D)(with dedicated cores)1212

Execution timeComputational efficiencyCore deploymentImproved!Reduced !16Min Si, CCGrid2015 - Scale Challenge

(H2O)21Computation-intensive (T)CCSD(T) simulation processingWHY We Achieve ~100% Efficiency ? W21 using 1704 coresW21 using 6144 cores# COMP# ASYNCOriginal MPI240Casper231Loss only 1 (4%) COMP coresThread (O) (with oversubscribed cores)2424Core oversubscriptionThread (D)(with dedicated cores)1212Loss 50% COMP cores17Min Si, CCGrid2015 - Scale Challenge

Summary18Min Si, CCGrid2015 - Scale ChallengeNWChem CCSD(T) with Casper ASYNCWe scale NWChem simulationsWe utilize portable & scalable MPI Asynchronous Progress We scale water molecule problems at ~100% parallel & computational efficiency on ~12288 cores and reduce ~45% execution time !NWChem CCSD(T) with original MPI (no ASYNC) + Casper

0100%50%25%75%100%50%25%75%Task Processing (%)

Core Utilization (%)0100%50%25%75%100%50%25%75%Task Processing (%)COMMCOMPOnly 50% Computational EfficiencyClose to 100% Computational EfficiencyCore Utilization (%)

Documents

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],