Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics

Embed Size (px)

Citation preview

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    1/10

    Quantum Chemistry on Graphical Processing Units. 3.Analytical Energy Gradients, Geometry Optimization, and

    First Principles Molecular Dynamics

    Ivan S. Umtsev and Todd J. Martinez*

    Department of Chemistry, Stanford Uni Versity, Stanford, California 94305

    Received June 11, 2009

    Abstract: We demonstrate that a video gaming machine containing two consumer graphical

    cards can outpace a state-of-the-art quad-core processor workstation by a factor of more than180 in Hartree- Fock energy+ gradient calculations. Such performance makes it possible torun large scale Hartree- Fock and Density Functional Theory calculations, which typically requirehundreds of traditional processor cores, on a single workstation. Benchmark Born- Oppenheimermolecular dynamics simulations are performed on two molecular systems using the 3-21G basisset - a hydronium ion solvated by 30 waters (94 atoms, 405 basis functions) and an asparticacid molecule solvated by 147 waters (457 atoms, 2014 basis functions). Our GPU implementa-tion can perform 27 ps/day and 0.7 ps/day ofab initio molecular dynamics simulation on a singledesktop computer for these systems.

    Introduction

    The idea of using graphical hardware for general purposecomputing goes back over a decade.1 - 3 Nevertheless, earlyattempts to use GPUs for scientic calculations were largelystymied by lack of programmability and low precision of the hardware. The recent introduction of the Compute UniedDevice Architecture (CUDA) programming interface4 andhardware capable of performing double precision arithmeticoperations by Nvidia signicantly simplied GPU program-ming and has triggered an increasing number of publicationsin different elds, such as classical molecular dynamics5 - 9and quantum chemistry.10 - 15 Our initial implementation of two-electron repulsion integral evaluation algorithms13 and

    the entire direct self-consistent eld procedure on theGPU14,15 demonstrated the large potential of graphicalhardware for quantum chemistry calculations. In this article,we continue exploring the use of GPUs for quantumchemistry, including the calculation of analytical gradientsfor self-consistent eld wave functions and the implementa-tion of ab initio Born- Oppenheimer (BO) molecular dy-namics. We compare the GPU performance results to theGAMESS22 quantum chemistry package running on an IntelCore2 quad-core 2.66 GHz CPU workstation, a state-of-the-

    art desktop computing system. Our machine to machinrather than GPU to CPU core comparison providesrealistic estimate of the actual speedup that can be obtainin real calculations. Comparison of the total time requirto calculate energies and gradients on these two platformshows that the GPU workstation is 2 orders of magnitufaster than the CPU workstation. This allows us to carry oab initio molecular dynamics of large systems at more thaa thousand MD steps per day on a desktop computer.

    Analytical Energy Gradient Implementation. The gen-eral formula for nuclear gradients follows directly from texpression for the Hartree- Fock energy including a termaccounting for the basis set dependence on moleculgeometry16

    whereA labels an atomic center,W is the energy weighteddensity matrix,S is the overlap matrix,D is the densitymatrix, and [ | ] are two-electron repulsion integrals overprimitive basis functions* Corresponding author e-mail: [email protected].

    A E HF )

    D ( A H ) -

    W ( AS ) +

    ( D D - 12 D D )[ A( )| ] (1)

    J. Chem. Theory Comput. 2009, 5, 26192628 2619

    10.1021/ct9003004 CCC: $40.752009 American Chemical SocietyPublished on Web 08/25/2009

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    2/10

    where r b { x, y, z} is the electronic coordinate,Rb

    { X ,Y , Z } is the position of the atomic center associatedwith theth basis function, andR determines the width of this basis function. The total angular momentum of isgiven by the sum of the three integer parametersl ) n x + n y+ n z and is equal to 0, 1, 2, fors-, p-, d-type functions, etc.Our program currently supports onlys- and p-functions,although we are working on implementation of higher orbitalmomentum functions. The rst two sums in eq 1 require littlework compared to the last one and are calculated on the CPUin our implementation. All calculations on the CPU arecarried out in full double precision, while calculations onthe GPU are carried out in single precision unless otherwiseindicated. The last sum, which is evaluated on the GPU,combines Coulomb and exchange contributions and runs overonly those pairs where at least one of or is centeredon atomA. As in our implementation of direct SCF,14,15 wetreat the Coulomb and exchange terms separately, generatingall required two-electron integrals and gradients from scratchusing the McMurchie-Davidson algorithm17

    where p ( q) is a Hermite Gaussian product centered on Rb

    + ( Rb + ), andE p ( E q ) are the expansion coefcients of the

    Cartesian Gaussian product ( ) over the Hermitefunctions, i.e.

    The nuclear gradients A

    and B

    can now be represented inthe new variablesRb + and Rb -

    whereA and B label the atomic centers of the functions and, respectively. In the following, we will use the notation[ p|q] as a shorthand to indicate the integral over Hermitefunctions in eq 4, i.e. [ p| q]. Thus, the indicesp andq arepair indices corresponding to or , respectively. Thisnotation follows that introduced earlier in a comprehensivearticle by Gill on two-electron integral generation.18

    A. Coulomb Contribution. The Coulomb contributionto A E is generally given by

    Following established practice, we expand the CartesiGaussian primitive pair products over Hermite Gaussian bafunctions and preprocess the density matrix elemenaccordingly19,20

    Substituting eq 8 into eq 11 leads to the nal result f ACoul E HF

    In the cases under consideration here (s andp basis functionsand no special treatment of sp-superblocks), there is onlyone pair of indices for a givenp, and we indicate thiswith the notation(p) in eq 13. This dependence issuppressed for notational convenience in the following. W

    calculateJ p and J p on the GPU, while the preprocessing ofeq 12 and the postprocessing of eq 13 are carried out on CPU.

    Evaluation of J p andJ p is performed in a way very similarto our GPU implementation of the Coulomb matrix formatalgorithm.15 Here we use the so-called 1T1PI mappingscheme where one GPU thread calculates one primitivintegral (or a batch of integrals if higher thans angularmomentum functions are involved). This has proved to the best choice if all required quantities are calculated direcfrom primitive integrals.13 The fundamental data organizationused for calculatingJ p and J p is represented in Figure 1,

    using the same notation as in our previous article.15

    The leftand upper triangles represent the index-symmetry pruned liof the ([ p|, [3 p|), and |q] pairwise quantities (PQs) withdoubled off-diagonal terms. The PQs are rst organized inthree groups according to the total angular momentum the pair products,ss, sp , orpp . Furthermore, the [ p| and |q]lists within each angular momentum grouping are sortaccording to their Schwartz upper bounds

    where we also use the maximum Cartesian density matrelements among all angular momentum functions in a bat

    [ | ] ) A (r b 1) (r b 1) 1r 12 (r b 2) (r b 2)d r b 1

    3d r b 23

    (2)

    (r b ) ) c ( x - X )n x( y - Y )

    n y( z - Z )n z

    exp(-R (r b - Rb )2) (3)

    [ | ] ) pq

    E p E q

    [ p| q] (4)

    (r b - Rb ) (r b - Rb ) ) p

    E p ( Rb

    - ) p(r b - Rb + )

    (5)

    Rb + )

    R Rb

    + R Rb

    R + R (6)

    Rb - ) Rb - R

    b (7)

    A)

    R R + R

    R++

    R- (8)

    B)

    R+-

    A (9)

    ACoul E HF )

    D D [ A( )| ] (10)

    ACoul E HF ) A( p D p q Dq[ p| q]) (11)

    D p ) p

    E p D , Dq )

    q E q

    D (12)

    ACoul E HF )

    p

    R ( p)R ( p) + R ( p)

    J p D p + p

    J p R- D p

    (13) J p )

    q

    Dq[ p|q] (14)

    J p ) q

    Dq[ R+ p|q] ) - q

    Dq[ r p |q] (15)

    [ p|Schwartz ) | D |max[ p| p]1/2 (16)

    |q]Schwartz ) | D |max[q |q]1/2 (17)

    2620 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 Umtsev and Martinez

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    3/10

    Following the usual practice,19 we calculate the Schwartzupper bound in eqs 16 and 17 assumingp and q have zeroangular momentum. The grouping and sorting of the lists of

    PQs as well as the calculation of the relevant quantities iscarried out on the CPU.When combined, ([ p|, [3 p|), and |q] lead to the ([ p|q],

    [3 p|q]) integrals required by eqs 14 and 15. These aredepicted by blue-green bordered squares in Figure 1 withone square denoting all ([ p|q], [3 p|q]) integrals computedsimultaneously by one GPU thread. For example, there are4 N q such integrals ([ 0|q], [ x|q], [ y|q], and [ z|q]) if both and functions have zero angular momentum, andthe |q] batch hasN q PQs ( N q ) 1, 4, 10 forss-, sp-, pp-typebatches, respectively). Because of the grouping by angularmomentum, the resulting [ p|q] integral grid consists of ninerectangular segments of different integral types ([ss|ss],[ss|sp], and so on, up to [ pp| pp]). Each of these integral gridsis handled by a different GPU kernel, optimized for thespecic angular momentum. For simplicity, only one suchgrid segment is presented in Figure 1. The pink-bluecoloration in Figure 1 represents the density-weightedmagnitude of the Schwartz bounds for the pair-quantities andthe resulting [ p|q] integrals.

    It can be easily seen from eqs 14 and 15 thatJ p and J pcan be computed by summing all integral contributions inone rowp. Because we use the 1T1PI mapping, each blue-green bordered square in Figure 1 is mapped to a GPU threaddepicted as a small orange square. For simplicity, four suchGPU threads are depicted, organized into a 2 2 threadblock. In practice, we use 8 8 thread blocks in the

    implementation. One such block then processes two rowof the integral matrix in column-by-column fashion, acumulating partial results in the corresponding GPU threa

    with double precision accuracy. Thus, the integrals anintegral derivatives are computed in single precision, but accumulation is done in double precision (on the GPU). shown previously,15 this procedure avoids unnecessaryprecision loss with minimal cost. After the block reachintegrals with a Schwartz upper bound smaller than 10- 11au, the scan is aborted, and subsequent intrablock row-wsum reduction leads to twoJ p and J p elements. The PQpresorting step guarantees that integrals omitted during tscan are even smaller than 10- 11 au and thus can be safelydisregarded. BecauseJ p is exactly the same quantity calcu-lated in our previously described Coulomb matrix formatialgorithm,15 whileJ p merely adds some additional terms tobe calculated, we simply modied theJ -matrix GPU kernelsto incorporate these additional terms.

    B. Exchange Contribution. The exchange contribution

    does not allow easy splitting of the work into theD p[ p| and|q ] Dq product representation and thus requires different daorganization. Therefore, we generate all required PQs aintegrals from scratch without reusing any data from tCoulomb step. In addition, we do not preprocess the densmatrix elements on CPU. Instead, the density matrix preprocessed on the y for each integral

    Figure 1. J p and J p calculation algorithm. Each blue-green-bordered square represents four sets of primitive Hermitian integra([p |q ] and [3 p |q ]), which need to be contracted with density matrix elements along ap -row in order to obtainJ p and J p . Theupper and left triangles denote thebra- and ket- pair quantities, sorted according to their Schwartz upper bound (represented bythe pink-blue coloration). A GPU 2 2 thread block is delineated by a large orange-bordered square comprised of four smalleryellow-bordered squares (GPU threads). The yellow arrows labeled memory loads represent elements of the [p| and |q] linearrays loaded into local memory by the corresponding GPU threads. The blue arrows show the direction along which the bloscans the two neighboring rows, leading to two {J p , J p } pairs of elements. The scan is aborted once each thread in a blockencounters a [p |q ] integral whose Schwartz upper bound is smaller than 10- 11 au.

    A Exch E HF ) -

    12

    D D [ A( )| ] (18)

    Quantum Chemistry on Graphical Processing Units. 3 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 2621

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    4/10

    Substituting eq 8 into eq 19 leads to the nal result for theexchange contribution to the energy gradient

    Unlike in the Coulomb energy gradient calculation, here onlythe [ | pair list is pruned according to T symmetry(doubling off-diagonal pairs as appropriate), while the | ]list contains allO( N 2) pairs, whereN is the total number of primitive Gaussian-type basis functions. This organizationis dictated by the need to carry out prescreening of smallintegrals in a way that is commensurate with memory accessand load balancing requirements on the GPU. Because the| ] list is not pruned by index-symmetry, there are fourrather than three different angular momentum type pairs ss, sp , ps, andpp . Thus, the nal integral grid containstwelve segments of different integral types (and requirestwelve GPU kernels to handle them). Figure 2 provides moredetails on the GPU implementation of the exchange contri-bution to energy gradients. For simplicity, only one segment,e.g. ss|ss , is represented. Other segments are treated in thesame way. We group [ | and | ] pairs according to therst index ( or , respectively). This procedure leads toN blocks, each with xed [ i. . . | and | i. . .]. Because the | ]list is not index-symmetry pruned, all | i. . .] blocks contain N Gaussian pairs. The number of pairs in the [ i. . .| blocksvaries (with a maximum of N ) because the [ | listis index-symmetry pruned. All [ i. . .| and | i. . .] blocks are sortedaccording to the Schwartz upper bound of corresponding [ i|and | i ] pairs. Note that no density matrix information isused in this sorting step. These presorted [ i| and | i ] PQsare delineated by left and upper triangles in Figure 2, wherethe pink-blue coloration represents the Schwartz upper boundmagnitude (as in Figure 1). After the PQs are sorted,K p andK p are calculated on the GPU by a series of twelvesubsequent GPU kernel calls (one for each angular momen-

    tum segment). Figure 2 provides details on the GPUimplementation. Here, a 2 2 GPU thread block (we use 8 8 blocks in our program) is tasked to calculate twoK pand K p quantities by scanning two nearby rows of theintegral grid and accumulating (in double precision, as inthe Coulomb algorithm) the partial results in the GPUregisters. During the scan, each thread monitors the productof Schwartz integral upper bound and the maximum Carte-sian density matrix elements among all angular momentumfunctions in a batch, i.e.

    When this product becomes smaller than 10- 11 au, a GPUthread aborts processing of the | i. . .] row and resumes from

    segment | i+ 1. . .]. The fact that [ p| and |q] pairs are organizedinto | i. . .] and | j. . .] blocks guarantees that for all threadin a GPU block the maximum density matrix element | D i j|maxis the same. Thus, all threads in a GPU block will abort scan of the | i. . .] segment simultaneously, avoiding potentiaproblems due to load misbalancing. After the scan complete, the nal result is obtained by intrablock sureduction.

    C. Multi-GPU Parallelization. The energy gradient codeis parallelized using POSIX threads in order to use aavailable GPUs in a workstation. For both Coulomb anexchange contributions, the work is split into 8-row segmeand mapped to the devices cyclically. Each GPU thucomputes its portion of J p, J p, K p, andK p and sends theresults back to the host CPU, where they are postprocessand the nal energy gradient is calculated. This model aseems well suited for implementation on a multiple GPnode cluster (using the MPI framework,21 for example) sincethe work is distributed in such a way that each node hall the data required to calculate its portion of the integrfrom the very beginning, avoiding expensive internocommunication. Preliminary implementations support thconjecture, and work along these lines is in progress.

    A series of tests performed on a system with two GeFor295GTX cards, each having two GPU processors, demostrated reasonable speedup of 3.0- 3.5 , relative to a singleGPU processor. Reported timings include the time requirto calculate and sort the pair-quantities, which is currenperformed on a single CPU core, as well as the data transtime required to copy the PQs to the GPU and then cothe results back to the CPU.

    Results and Discussion

    To assess the performance of the GPU code, we carried oa series of benchmarks on a representative set of temolecules and compared the results to GAMESS22 ver. 11Apr 2008 (R1). The GAMESS code was executed on an InCore2 quad-core 2.66 GHz CPU with 8GB main memorwhich represents a state-of-the-art desktop computing systeAll four CPU cores were used in parallel in order to obtamaximum CPU performance. The GAMESS program wcompiled with the GNU Fortran compiler and linked wIntel MKL ver. 10.0.3. Our GPU code ran on the samworkstation with two Nvidia GeForce 295GTX cards opering in parallel. All performance results in this articcorrespond to this workstation to workstation comparisrather than a single GPU to a single CPU core comparisThis provides a realistic assessment of the real performangain one can obtain from a GPU system. For brevity, thquad-core CPU machine and the dual-GPU machine areferred to as CPU and GPU, respectively.

    Table 1 presents the time required to calculate thHartree- Fock energy gradient vector for caffeine(C8N4H10O2), cholesterol (C27H46O), buckyball (C60), taxol(C45NH49O15), valinomycin (C54N6H90O18), and olestra(C156H278O19) molecules using the 3-21G basis set. Amonthese test systems, the largest has 2131 basis functionOne can see that even for a small molecule such acaffeine, the GPU outperforms the CPU by a factor

    A Exch E HF ) -

    12 A( p E p

    q

    D pq [ p| q]) (19) D pq ) E q

    D D (20)

    A Exch

    E HF ) -12( p

    R R + R K p E p

    + p K p R- E p

    )(21)K p )

    q

    D pq [ p|q],K p ) - q

    D pq [ r p |q] (22)

    ([ p| p][q |q])1/2| D |max (23)

    2622 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 Umtsev and Martinez

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    5/10

    6 . For medium-size molecules the speedup rangesbetween 20 and 25 , while for large molecules itexceeds 100 . In addition, in Table 2 we present the totalprogram execution time, energy+ gradient, for the sameset of molecules. In both CPU and GPU calculations thetotal number of SCF iterations required to converge thewave function was exactly the same, although it differedamong the molecules. The resulting speedups range from4 for small-, 30 - 40 for medium-, and up to almost200 for the largest molecules. Clearly, these resultsrepresent a mixture of different coding styles, compiler

    efciencies, and hardware architectures. However, it also obvious that such speedups would never be achievabwithout the impressive performance gain provided by tGPU, which enables desktop calculation of ab initiogeometry optimization and molecular dynamics (MDsimulations that were previously only possible on compu

    ing clusters with more than a hundred CPUs.The increased performance of the GPU code clearimproves the quality of molecular dynamics simulatiresults by allowing longer runs and thus better statisticHowever, accuracy is another aspect that needs to bconsidered, especially when part of the energy and atomforce calculations are performed with single precision. Tassess the error introduced by the use of single precision integral evaluation, we performed three independent tes

    First, we directly calculated the root mean squared errfor all components of corresponding single precision (GPand double precision (CPU) gradient vectors

    for all the benchmark molecules. All molecular geometralong with corresponding energies and gradients, calculaton CPU and GPU, are provided in the Supporting Informtion. The results, presented in Table 1, demonstrate that tmean error is distributed around 10- 5 au, which is close tothe typical convergence thresholds used in geometry opmization algorithms. It is also important that there is obvious correlation between the mean error and the size a molecule, i.e. the error does not increase with the numbof atoms.

    Figure 2. K and K calculation algorithm. Similar to theJ calculation shown in Figure 1, but in this case allket -pairs aregrouped intoN segments according to thei index. Two such segments are represented. The GPU thread block scans all thesegments sequentially. Once integrals which make contributions smaller than 10- 11 au are reached, the scan of the particularsegment is aborted, and the block proceeds from the next segment.

    Table 1. Analytical Energy Gradient Accuracy andCalculation Time for Different Test Molecules Using the3-21G Basis Setmolecule caffeine cholesterol C60 taxol valinomycin olestraCPU, seca 0.7 7.6 38.5 28.9 66.3 785.1GPU, secb 0.1 0.4 1.9 1.4 2.6 7.3

    Speedup 5.8 19 20 21 26 108rmserror / 10- 5 auc

    1.28 0.61 2.79 1.49 1.24 0.67

    a GAMESS on Intel Core2 quad-core 2.66 GHz CPU.b 2 NvidiaGeForce 295GTX cards.c Error in the gradient (atomic units) asdened by eq 24.

    Table 2. Total Energy and Gradient Computation TimeUsing the 3-21G Basis Seta

    molecule caffeine cholesterol buckyball taxol valinomycin olestraCPU, secb 7.4 81.8 364 435 1112 22863GPU, secb 2.0 4.8 11.6 15.7 25.5 125Iterations 13 11 10 14 13 14Speedup 3.7 17 31 28 44 182

    a The number of SCF iterations performed to converge thewave function was exactly the same for GPU and CPU.b CPUand GPU are the same machines as in Table 1. RMSerror

    ) i) 1

    3 N Atoms

    ( f iCPU - f iGPU )2 /3 N Atoms (24)

    Quantum Chemistry on Graphical Processing Units. 3 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 2623

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    6/10

    Second, we carried out geometry optimization of a helicalhepta-alanine, i.e. (ALA)7, peptide on the CPU and GPU usingthe 3-21G basis set. The corresponding SCF energy evolutionand resulting structures from GAMESS and our GPU code arecompared in Figure 3. In both cases, the same initial geometryand energy minimization algorithm was used (trust region

    method with BFGS Hessian update), although our implementa-tion was written from scratch and therefore the codes are notnecessarily identical. However, we did employ exactly the sametrust radius update protocol as implemented in GAMESS. Itcan be easily seen that the curves follow each other very closelythroughout the whole optimization procedure, slightly divergingat the very end. This is a good indicator that GPU can beefciently used for solving molecular geometry optimizationproblems. The nal energy discrepancy is 0.50 kcal/mol, of which 0.42 kcal/mol is due to different atomic congurationsand the rest is due to lower accuracy of single point energycalculations on GPU. In addition, the optimized structuresoverlap almost perfectly, as shown in the inset of Figure 3,where the CPU-optimized and GPU-optimized structures areportrayed in orange and multicolored representations, respectively.

    Finally, conservation of total energy is a common metricfor assessing the energy gradient accuracy in a dynamicsalgorithm. Therefore, we performed time-reversible23Hartree- Fock Born- Oppenheimer molecular dynamicssimulation of an H3O+ (H2O)30 cluster using the 6-31Gbasis set and microcanonical ensemble. The Newtonianequations of motions were integrated using the velocityVerlet algorithm with a 0.5 fs time step for a totalsimulation time of 20 ps. Figure 4 shows the resultingtime evolution of the kinetic (red), potential (blue), andtotal (green) energies. Energy is conserved quite well, witha small 0.022 kcal/mol ps- 1 total energy drift. Consider-

    ing, for example, an upper limit of 5% of the initial kineenergy as a maximum acceptable total energy drift ovthe entire simulation, this system can be simulated f200 ps. Even longer time scales can be accessed if ointroduces Langevin thermostats and careful adjustmeof the damping parameter,24 although care needs to be

    taken in this case to ensure that the dynamics is nsignicantly modied by the damping parameter. Tfurther demonstrate that GPU-basedab initio moleculardynamics can treat important phenomena of widesprechemical interest such as proton transfer, we performeAIMD simulations of two systems: a H3O+ (H2O)30 clusterand a neutral aspartic acid molecule solvated by 147 wamolecules.

    A. Protonated Water Cluster. A 20-ps long NVT (T )300 K) molecular dynamics simulation was performed the H3O+ (H2O)30 cluster at the RHF/6-31G, BLYP/6-31G,and B3LYP/6-31G levels using a 0.5 fs integration time stThe density functional theory calculations (energy annuclear energy gradient), which run entirely on GPU ause all available GPU processors in parallel, were recenincorporated into TeraChem, the general purpose GPU-basquantum chemistry package being developed in our grouAll three simulations started from an initial cluster geomewhere the hydronium ion was located on the cluster surfa

    Figure 5 shows the time evolution of the distancbetween the hydronium ion and the cluster center alonwith the mean cluster radii represented by dashed lineThe cluster radius was dened as the maximum distanbetween any of the O-atoms and the cluster center. In athree cases, the ion stayed close to the surface throughothe entire simulation, as previously reported.25,26 Some-what larger oscillations of the distance in the RH

    Figure 3. Evolution of the (ALA)7 helix SCF energy duringgeometry optimization on the CPU (blue) and GPU (red) using3-21G basis set. Both curves follow each other very closely,demonstrating that the GPU accuracy is sufcient for solvinggeometry optimization problems. The nal energy difference

    is 0.50 kcal/mol, of which 0.08 kcal/mol is due to loweraccuracy of single point energy calculations on GPU and 0.42kcal/mol is due to different atomic positions. The inset portraysthe two optimized structures overlaid on top of each other(orange: CPU, multicolored: GPU).

    Figure 4. The SCF (blue), kinetic (red), and total (green)energies of the H3O+ (H2O)30 cluster during microcanonical( 301 K) Born- Oppenheimer Hartree- Fock moleculardynamics simulation using the 3-21G basis set on two NvidiaGeForce 295GTX GPUs. Two electron integrals and their

    derivatives are calculated on the GPU in single precision, andtheir contributions are accumulated in double precision. Thetotal energy drift is 0.022 kcal/mol ps- 1, which correspondsto a 0.039 K ps- 1 averaged temperature drift.

    2624 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 Umtsev and Martinez

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    7/10

    simulation, compared to the DFT results, show that at the

    HF level of theory the structure of the cluster surface issubject to higher uctuations. In addition, RHF and BLYPprovide similar structures of the rst H3O+ solvation shell,as can be seen from the oxygen- oxygen radial distributionfunctions presented in Figure 6 (red and blue lines,respectively). In both cases, the rst peak is centered at2.50 , which is smaller than the same value for bulkwater (2.8 ) and is an expected signature of strongerH-bonds. The B3LYP simulation (green line in Figure 6)predicts a somewhat more diffuse rst solvation shell, withthe rst peak centered at 2.46 . In addition, the B3LYPRDF reveals stronger bimodal character around themaximum due to continuous interplay between Eigen andZundel structures. In all three simulations there was an

    average of three water molecules in the rst solvation shof the hydronium ion.

    B. Solvated Aspartic Acid. We also performed a 7-pssimulation of a single neutral aspartic acid molecusolvated by a cluster of 147 water molecules using th3-21G basis set. The total number of atoms and basfunctions in this system was 457 and 2014, respectivelThe C4NH7O4 molecule was rst solvated in a cube of

    water and equilibrated for 100 ps using the CHARMforce eld27 and periodic boundary conditions. Then, a10 -sphere was sketched around the CR atom, and thosewaters whose O-atoms were located inside this sphere (1molecules) were selected for further modeling with AIMThe resulting system was equilibrated for 1 ps usinAIMD. In both classical andab initio equilibration MDsimulations, a 0.5 fs integration time step and the Langevthermostat28 (T ) 300 K, damp ) 1 ps) were used, andthe atoms in the aspartic acid molecule were xed in ordto prevent deprotonation of the carboxylic groups. Finala 7-ps production AIMD run was performed on the systein the NVT ensemble using a 0.5 fs integration time steFigure 7 portrays a snapshot of the system (left panealong with its highest occupied molecular orbital (HOMright panel). The orbitals were calculated by the GPUaccelerated Orbital plugin implemented in VMD.29

    The aspartic acid molecule has two carboxylic acifunctional groups - backbone and side chain, referred as COOHbb and COOHsc in the following. Because bothe groups have low pK a, one might expect them todeprotonate quickly in aqueous solution. The AIMsimulation mostly conrms the expectations. Figure displays the time evolution of the O- H distance for bothCOOH groups. One of these (COOHsc, blue line in Figu8) quickly deprotonates in approximately 400 fs throuformation of a short-lived ( 50 fs) transient Zundel-like structure. The resulting hydronium ion then quick(within 50 fs) shuttles to the surface of the cluster vtwo subsequent proton transfer events and stays at thsurface for the rest of the simulation. The secondeprotonation event (of the COOHbb group) does noccur until signicantly later (t 5.5 ps) in the simulation,presumably because of the high proton afnity of thresulting doubly negatively charged amino acid ion. Aaborted attempt at deprotonation of COOHbb by formia quasi-stable COO- H3O+ complex is observed after3.1 ps of simulation (green line in Figure 8). The presenof positive counterions near the molecule (not accountfor in our simulation) would be expected to facilitate fasdeprotonation of all carboxylic groups. The carboxyl-waOcarboxyl-Ow (including both oxygen atoms of each of thCOOHsc and COOHbb groups) and amino-water Namino-Ow radial distribution functions, which are proportionto the local water density around these groups, apresented in Figure 9 and provide details on the solvestructure around these functional groups. The Ocarboxyl-OwRDF (red line in Figure 9) has its rst maximum centerat 2.69 , and integration to the rst minimum of the RDreveals that on average 2.5 water molecules are presein the rst solvation shell of each carboxyl oxygen.

    Figure 5. Time evolution of the H3O+ ion distance from thecluster center. Each point on the plot is averaged over 100successive MD steps. The dashed lines represent the clusterradius, averaged over the whole MD simulation run. Theradius is dened as the maximum distance between anO-atom and the center of the cluster.

    Figure 6. The hydronium-water oxygen- oxygen radial dis-tribution function for the H3O+ (H2O)30 cluster, using variouslevels of theory and the 6-31G basis set.

    Quantum Chemistry on Graphical Processing Units. 3 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 2625

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    8/10

    contrast, the Namino-Ow RDF (blue line in Figure 9) indicates arather hydrophobic character of the amino group, although thegroup periodically accepts and donates weak H-bonds. In manycases, however, the solvent water forms a hydrophobic cagearound NH2, where waters prefer to donate/accept H-bonds to/ from neighboring waters. Those water molecules that sometimesdo accept an H-bond from the amino group, in most casesbecome 5-coordinated. Because such a water solvation structureis less energetically favorable than the 4-coordinated tetrahedralconguration, these water molecules tend to break such H-bonds. Quantitative analysis of the trajectory demonstrates thatthe amino group donates 0, 1, and 2 H-bonds for 54%, 45%,and 1% of the simulation time, respectively. Similarly, it accepts0 and 1 H-bonds for 81% and 19% of the simulation time. AnH-bond was dened using 3.2 O- O distance and 30 (O/ N)OwHw angle cutoffs. The character of the distribution of H-bonds donated by NH2 does not vary much during thesimulation which accesses three different charge states of theasparticacid (0,- 1, and- 2).However, theamino group revealsstronger hydrophilic properties at the- 2 charge state bystabilizing the accepted H-bond and reducing the mean N- Ow

    distance to the nearest water molecule from 3.1 (charge st0 and- 1) to 2.8 (charge state- 2). Figure 10 shows thetime evolution of the N- Ow distance, where the COOH groupdeprotonation events (for COOHsc and COOHbb) are markby red vertical lines. The black dashed line denotes the mN- Ow distance for all (0,- 1, and- 2) charge states of theaspartic acid molecule before and after the deprotonation

    COOHbb. For both neutral and singly negative charge stathe distance oscillates around 3.1 , but it drops to 2.8 aCOOHbb is deprotonated and the molecule becomes doubnegatively charged. The hydrophobic-like behavior of theamigroup in the partially deprotonated aspartic acid molecule dnot seem to preclude protonation. In a preliminary simulat(not shown), the NH2 residue is rapidly protonated if there is ahydronium ion in its rst solvation shell. However, we leadetailed analysis of this for future, more extensive, studies

    Conclusions

    We have demonstrated that it is possible to achieve up 200 speedup in energy+ gradient calculations byredesigning quantum chemistry algorithms for the GPU

    Figure 7. Left: snapshot of the AIMD simulation of the aspartic acid molecule solvated by 147 waters. Right: HOMO along wnearby water molecules among which the orbital is mostly delocalized. The isosurfaces correspond to ) ( 0.01 au .

    Figure 8. The O- H distance in backbone (COOHbb, green)and side chain (COOHsc, blue) carboxylic acid functionalgroups. The dashed line represents the O- H distance cor-responding to the COO- H3O+ complex.

    Figure 9. The aspartic acid oxygen (red) and nitrogen (blue)- water oxygen radial distribution functions. The neutral aminogroup reveals prominent hydrophobic character.

    2626 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 Umtsev and Martinez

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    9/10

    The performance gain was assessed by comparing a state-of-the-art Intel Core2 quad-core 2.66 GHz CPU worksta-tion with the same workstation containing two NvidiaGeForce 295GTX graphical cards. Both Hartree- Fock anddensity functional theory (including generalized gradientand hybrid functionals with exact exchange) electronic

    structure methods can be used in AIMD simulations withour implementation. The remarkable speedups attained byexecuting all computationally intensive parts of the codeon the GPU rather than on the CPU make it possible tocarry outab initio molecular dynamics simulation of largesystems containing more than 2000 basis functions at 1400MD steps/day speed on a single desktop computer.

    We show in an example H3O+ (H2O)30 cluster MD simula-tion that total energy drift due to limited hardware precisionis minor and 100 ps MD simulations can be performed withtotal energy conservation errors of less than a few percentof the initial kinetic energy. Furthermore, even this minordrift can be compensated by employing a Langevin thermo-stat and properly adjusting the damping parameter,24 meaningthat even longerab initio MD runs can be performed on theGPU.

    We have presented results for preliminary AIMD simula-tions of proton transfer and transport in solvated clusters thatwere facilitated by the developments described here. Furtherstudy of these phenomena, collecting signicant statistics,is underway.

    Acknowledgment. This work was supported by theNational Science Foundation (CHE-06-26354). TeraChemdevelopment was carried out at the University of Illinois.I.S.U. is an Nvidia fellow.

    Supporting Information Available: Cartesian coor-dinates and SCF energies and gradients computed using tCPU (double precision) and GPU (mixed precision) for tmolecules listed in Table 1. This material is available frof charge via the Internet at http://pubs.acs.org.

    References

    (1) Lengyel, J.; Reichert, M.; Donald, B. R.; Greenberg, D. Comput. Graph. 1990 , 24 , 327.

    (2) Bohn, C. A.Joint Conference on Intelligent Systems 1999(JCIS98) ; 1998; Vol. 2, p 64.

    (3) Hoff, K. E., II; Culver, T.; Keyser, J.; Ming, L.; ManochD. Computer Graphics Proceedings. SIGGRAPH 99 ; 1999;p 277.

    (4) NVIDIA CUDA. Compute Unied Device ArchitectuProgramming Guide Version 2.2. http://www.nvidia.comobject/cuda_develop.html (acccessed May 15, 2009).

    (5) Stone, J. E.; Phillips, J. C.; Freddolino, P. L.; Hardy, D. Trabuco, L. G.; Schulten, K.J. Comput. Chem. 2007 , 28 ,2618.

    (6) Anderson, J. A.; Lorenz, C. D.; Travesset, A.J. Comp. Phys.2008 , 227 , 5342.

    (7) Liu, W. G.; Schmidt, B.; Voss, G.; Muller-Wittig, W.Comput.Phys. Commun. 2008 , 179 , 634.

    (8) Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; HoustonM.; Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. MPande, V. S.J. Comput. Chem. 2009 , 30 , 864.

    (9) Giupponi, G.; Harvey, M. J.; De Fabritiis, G.Drug Disco VeryToday 2008 , 13 , 1052.

    (10) Yasuda, K.J. Comput. Chem. 2008 , 29 , 334.(11) Yasuda, K.J. Chem. Theory Comput. 2008 , 4, 1230.(12) Vogt, L.; Olivares-Amaya, R.; Kermes, S.; Shao, Y.; Amado

    Bedolla, C.; Aspuru-Guzik, A.J. Phys. Chem. A 2008 , 112 ,2049.

    (13) Umtsev, I. S.; Martinez, T. J.J. Chem. Theory Comput.2008 , 4, 222.

    (14) Umtsev, I. S.; Martinez, T. J.Comput. Sci. Eng. 2008 , 10 ,26.

    (15) Umtsev, I. S.; Martinez, T. J.J. Chem. Theory Comput.2009 , 5, 1004.

    (16) Pulay, P.Mol. Phys. 1969 , 17 , 197.(17) McMurchie, L. E.; Davidson, E. R.J. Comp. Phys. 1978 ,

    26 , 218.(18) Gill, P. M. W.Ad V. Quantum Chem. 1994 , 25 , 141.(19) Almlof, J.; Faegri, K.; Korsell, K.J. Comput. Chem. 1982 ,

    3, 385.(20) Ahmadi, G. R.; Almlof, J.Chem. Phys. Lett. 1995 , 246 , 364.(21) Gropp, W.; Lusk, E.; Skjellum, A.Using MPI: Portable

    Parallel Programming with the Message Passing Interface ,2nd ed.; MIT Press: Cambridge, 1999.

    (22) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, NNguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, MMontgomery, J. A.J. Comput. Chem. 1993 , 14 , 1347.

    (23) Niklasson, A. M. N.; Tymczak, C. J.; Challacombe, M.Phys. ReV. Lett. 2006 , 97 , 4.

    Figure 10. Time evolution of the NOw distance betweenthe amino nitrogen and the nearest water molecule. Depro-tonation events of the carboxylic acid group are markedby red vertical lines. The horizontal dashed lines represent

    the mean NOw distance for all (0,-

    1, and-

    2) charge statesof the aspartic acid molecule. The mean distance isessentially the same ( 3.1 ) for 0 and- 1 charge statesbut suddenly drops to 2.8 as the backbone carboxylicacid group is deprotonated.

    Quantum Chemistry on Graphical Processing Units. 3 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 2627

    D

    W

  • 8/3/2019 Ivan S. Ufimtsev and Todd J. Martinez- Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Prin

    10/10

    (24) Kuhne, T. D.; Krack, M.; Parrinello, M.J. Chem. TheoryComput. 2009 , 5, 235.

    (25) Buch, V.; Milet, A.; Vacha, R.; Jungwirth, P.; Devlin, J. P.Proc. Natl. Acad. Sci. U. S. A. 2007 , 104 , 7342.

    (26) Petersen, M. K.; Iyengar, S. S.; Day, T. J. F.; Voth, G. A.J.Phys. Chem. B 2004 , 108 , 14804.

    (27) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.;Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.;Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau,F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.;

    Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; SmitJ. C.; Stote, R.; Straub, J.; Watanabe, M.; WiorkiewiczKuczera, J.; Yin, D.; Karplus, M.J. Phys. Chem. B 1998 ,102 , 3586.

    (28) Schlick, T.Molecular Modeling and Simulation ; Springer:New York, 2002.

    (29) Humphrey, W.; Dalke, A.; Schulten, K.J. Mol. Graph. 1996 ,14 , 33.

    CT9003004

    2628 J. Chem. Theory Comput., Vol. 5, No. 10, 2009 Umtsev and Martinez

    D

    W