Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

Embed Size (px)

Citation preview

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    1/9

    Accelerating correlated quantum chemistry calculations

    using graphical processing units

    Mark A. Watson,1 Roberto Olivares-Amaya,1 Richard G. Edgar,1 Tomas Arias,2 and Alan Aspuru-Guzik1,

    1Department of Chemistry and Chemical Biology,Harvard University, Cambridge, Massachusetts 02138.

    2Laboratory of Atomic and Solid State Physics, Cornell University, Ithaca, New York 14853.

    Graphical processing units are now being used with dramatic effect to accelerate quantum chem-istry applications. The authors give a brief introduction to electronic structure methods and describetheir efforts to accelerate a correlated quantum chemistry code. They propose and analyze two newtools for accelerating matrix-multiplications where single-precision accuracy is insufficient.

    I. INTRODUCTION

    With the advent of modern quantum theory a centuryago, scientists quickly realized that quantum mechanicscould offer a predictive theory of chemistry, revolution-izing the subject in the same way that Newtons lawshad transformed the study of classical mechanics. Over

    the intervening decades, we have witnessed an exponen-tial increase in available computing power. Coupled withtremendous advances in theory and algorithmic meth-ods, as well as the painstaking development of sophisti-cated program suites often consisting of millions of linesof code, the scope of phenomena now amenable to thepredictive techniques of quantum chemistry is large andcontinually growing.

    Modern computational chemistry has an importantrole to play in many practical applications, such as thediscovery of new drugs or industrial catalysts, and the de-velopment of new materials or technologies to meet globalenergy and environmental challenges. Its widespread ap-

    plication has resulted in major efforts to reduce its com-putational cost: accurate quantum chemistry methodsconsume a significant fraction of computing resources atnational laboratories.

    As a result, we are witnessing a new era in the op-timization of quantum chemistry codes following an ex-plosion of interest in the utilization of coprocessors suchas graphical processing units (GPUs). This interest inGPUs and other accelerators is largely driven by theircombination of formidable performance and relativelylow cost. But another key reason for their emergencein scientific fields was the release of NVIDIAs CUDA(compute unified device architecture) toolkit that dra-

    matically simplified the process of developing code forGPUs.

    Already, GPUs have started to be used by computa-tional chemists to treat a wide range of problems. Theseinclude molecular dynamics and quantum Monte Carlosimulations, density-functional theory and self-consistentfield calculations [1, 2] as well as correlated quantumchemistry applications [3, 4]. Efficiency gains of be-

    Electronic address: [email protected]

    tween one and three orders of magnitude have been re-ported compared to conventional implementations on aCPU. Thus new domains of scientific application are nowpossible where, previously, extremely expensive and raresupercomputing facilities would have been required.

    A very important area for many scientific applicationsis the acceleration of linear algebra operations, which

    are quite well-suited for GPU architectures. Includedin the CUDA software development toolkit is an imple-mentation of the BLAS linear algebra library, namedCUBLAS. By simply replacing the BLAS *GEMM rou-tines with corresponding CUBLAS SGEMM calls to ac-celerate key matrix multiplications, our group [3] wasable to achieve a speedup of 4.3x when calculating theRI-MP2 (resolution-of-the-identity second-order Mller-Plesset perturbation theory [5, 6]) correlation energy ofdoeicosane (C22H46).

    This initial effort was one of the first quantum chem-istry applications to leverage GPUs, but it revealed sev-eral issues for future work. For example, while mod-

    ern GPU cards designed for research can have up to4 GiB of RAM, consumer level cards may have as lit-tle as 256 MiB. Without some means to overcome thememory bottleneck, our first attempts to use GPUs toaccelerate RI-MP2 calculations were limited to systemswith less than 600 basis functions.

    Another issue is numerical precision. The vast major-ity of GPU cards currently in use worldwide support onlysingle-precision arithmetic. Single precision is generallyinsufficient to achieve chemical accuracy of 1 kcal/molin calculations on anything but the smallest and sim-plest systems, since the errors quickly accumulate forlarger molecules. An interesting topic in the computer

    science [7, 8] community has been the development oflibraries which achieve precision beyond the hardwarespecification through algorithmic techniques. In thiswork, we consider this problem in detail for the specialcase of single-precision GPU devices and applications toquantum chemistry.

    In summary, we describe our efforts to develop toolsfor the GPU acceleration of correlated quantum chem-istry calculations [3, 4]. We address the issues of limitedGPU device memory, as well as achieving higher accuracyusing only single-precision GPU operations. We begin insection II with an overview of basic quantum chemistry

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    2/9

    2

    theory, followed by the algorithms behind our new li-braries. Next we report the efficiency of our methods forgeneral matrix multiplications and in the context of ac-celerating quantum chemistry calculations. We end thearticle with a brief conclusion and our future perspective.

    II. QUANTUM CHEMISTRY THEORY

    Traditional quantum chemistry strives to solve thetime-independent Schrodinger equation,

    H(r,R)(r,R) = E(R)(r,R) (1)

    where (r,R) is the electronic wavefunction for a molec-

    ular system defined by the Hamiltonian H(r,R) in termsof a set of coordinates for the electrons, r, and nuclei, R.The total molecular energy is given by the eigenvalueE(R) and is often referred to as the potential energy sur-

    face. It is parameterized in terms of the nuclear coordi-nates since we have assumed the Born-Oppenheimer ap-proximation, and is fundamental to the ab initio theoryof chemical reaction dynamics.

    The solution of eqn. (1) yields the electronic wavefunc-tion for a given nuclear configuration, and in turn theprobability density for finding an electron at a given pointin space. Exact solution is intractable, even numerically,for all but the simplest systems due to the formidablenature of the many-body problem inherent in the formof H(r,R), which describes not only the electronic ki-netic energy, but also the complex Coulomb interactionsbetween the electrons and nuclei. Only for one-electronsystems, where there is no electron-electron coupling, areexact solutions readily available.

    Nevertheless, it is a hallmark of quantum chemistrythat there is a well-defined hierarchy of methods for solv-ing eqn. (1) approximately. Indeed, given sufficient com-putational power, a solution may be improved systemat-ically to yield a wavefunction of arbitrary numerical ac-curacy. The simplest method in the hierarchy is a mean-field approach known as Hartree-Fock (HF) theory.

    In HF theory, we write the electronic wavefunction foran N-electron system as an antisymmetrized product ofmolecular orbitals, (ri), which are wavefunctions for asingle electron under a one-electron Hamiltonian knownas the Fock operator,

    f(r,R) = h(r,R) +

    N/2

    j=1

    2Jj(r) Kj(r) (2)The Coulomb, Jj(r), exchange, Kj(r), and one-electron,

    h(r), operators are given by

    h(r,R) = 122 + v(r,R) (3)

    Jj(r) =

    j (r

    )j(r)

    |r r| dr (4)

    Kj(r)i(r) =

    j (r

    )i(r)

    |r r| drj(r) (5)

    where v(r,R) is the nuclear potential. The associatedHartree-Fock equations,

    f i(r) = ii(r) (6)

    permit a complete set of eigenfunctions. In the simplestcase where N is even, each of the N/2 orbitals corre-sponding to the lowest eigenvalues is associated with two

    electrons of opposite spin. These occupied orbitals areused to construct the many-electron wavefunction; theyare labelled by the subscripts i , j , . . . The remaining func-tions form the virtual orbitals and are indicated by thesubscripts a , b , . . ..

    Although in some implementations, the Hartree-Fockequations are solved numerically, the traditional ap-proach is to expand the molecular orbitals (MOs) in abasis of atomic orbitals, {},

    p(r) =

    Nb

    cp(r) (7)

    The optimized coefficients, cp, and orbital energies, p,are determined from the secular equations, which in ma-trix notation can be written as

    FC= SC (8)

    where S is the atomic orbital (AO) overlap matrix,

    |, F is the AO Fock matrix, |f|, C is thematrix of MO coefficients and is the diagonal matrixof MO energies. These are the celebrated Roothaan-Hallself-consistent field equations.

    The Hartree-Fock solution (in a complete AO basis)

    generally recovers more than 99% of the exact energy,which is remarkable. The neglected energy is known asthe correlation energy, and its accurate and efficient re-covery is the central challenge in quantum chemistry. In-deed, for the purposes of predictive chemistry, we arelargely interested in energy differences which are of sim-ilar magnitude to the correlation energy: about 0.04 Eh(100 kJ mol1) for two electrons in a doubly occupiedorbital.

    Currently the most popular approach for solvingeqn. (1) including electron correlation, is density func-tional theory (DFT). DFT, which bypasses explicit con-struction of the many-body wavefunction and focuses

    only on the much simpler 3-dimensional electron densityas the basic variable, is extremely useful due its favorablebalance between accuracy and efficiency.

    Instead, we examine in this article another ap-proach: a widely-used and computationally efficient cor-related wavefunction-based method, known as second-order Mller-Plesset perturbation theory (MP2)[5]. MP2is known to produce equilibrium geometries of compa-rable accuracy to density functional theory (DFT) andin particular is able to capture long-range correlation ef-fects such as the dispersion interaction. For many weaklybound systems where DFT results are often questionable,

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    3/9

    3

    MP2 is essentially the least expensive and most reliablealternative.

    If HF theory can be thought of as a first-order solutionto eqn. (1), then MP2 theory is the second-order solutionwithin the framework of perturbation theory, where themany-electron Hamiltonian is partitioned as

    H = i f(ri) + H(1) (9)

    We have introduced an order parameter, , to expandthe energy and wavefunction,

    E = E(0) + E(1) + 2E(2) + (10) = HF +

    (1) + 2(2) + (11)where HF is the zero-order (Hartree-Fock) wavefunc-tion, and the zero and first-order energies are given by

    E(0) =i

    i (12)

    E(0) + E(1) = HF|H|HF (13)The second-order (MP2) correlation energy takes the

    form

    E(2) =ijab

    (ia|jb)2 + 12

    [(ia|jb) (ib|ja)]2i + j a b (14)

    where the MO integrals,

    (ij|ab) =

    CiCjCaCb(|) (15)

    are obtained by transforming the AO integrals,

    (|) =

    (r1)(r1)(r2)(r2)

    |r1 r2| dr1dr2 (16)

    One way to considerably reduce the computational costof MP2 calculations is to exploit the linear dependencein the product space of atomic orbitals. This allows us toexpand products of AOs as linear combinations of atom-centered auxiliary basis functions, P,

    (r) = (r)(r) (r) =

    C,PP(r) (17)

    and to therefore approximate eqn.(16) in terms of only 2and 3-index quantities as

    (|) = P,Q

    (|P)(P|Q)1(Q|) (18)

    The idea is equivalent to an approximate insertion of theresolution-of-the-identity (RI),

    I =m

    |m)(m| P,Q

    |P)(P|Q)1(Q| (19)

    from which the name RI-MP2 is derived.

    Our work is implemented in a development version ofQ-Chem 3.1 [9], where the RI-MP2 correlation energy isevaluated in five steps, as described elsewhere [3]. Previ-ously we showed that steps 3 and 4, the formation of theapproximate MO integrals, were by far the most expen-sive operations for medium to large-sized systems, andrequire the matrix multiplications

    (ia|jb) Q

    Bia,QBjb,Q (20)

    and

    Bia,Q =P

    (ia|P) (P|Q)1/2 (21)

    These are the two operations we will concentrate onaccelerating in this work.

    III. GPU ACCELERATION OF GEMM

    In this section we first describe our cleaving algorithmto allow matrix multiplications of arbitrary size to be ac-celerated on the GPU (assuming sufficient CPU mem-ory). Next, we propose two different algorithms forthe MGEMM (mixed-precision general matrix multiply) li-brary, using two different schemes to partition the matri-ces into simpler components.

    A. Cleaving GEMMs

    Consider the matrix multiplication

    C = AB (22)

    whereA is an (mk) matrix, B is an (kn) matrix, andC an (m n) matrix. We can divide A into a columnvector of r + 1 matrices

    A =

    A0

    A1...A

    r

    (23)

    where each entry Ai is a (pi k) matrix, andr

    i pi =m. In practice, all the pi will be the same, with thepossible exception of pr, which will be an edge case. Ina similar manner, we can divide B into a row vector ofs + 1 matrices

    B =B0 B1 Bs

    (24)

    where eachBj is an (kqj) matrix ands

    j qj = n. Againall the qj will be the same, with the possible exception of

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    4/9

    4

    qs. We then form the outer product of these two vectors

    C =

    A0

    A1...Ar

    B0 B1 Bs

    (25)

    =A0B0 A0B1

    A0Bs

    A1B0 A1B1 A1Bs...

    . . .

    ArB0 ArBs

    (26)

    Each individual Cij = AiBj is an (pi qj) matrix, andcan be computed independently of all the others. Gen-eralizing this to a full *GEMM implementation, which in-cludes the possibility of transposes being taken, is tediousbut straightforward.

    We have implemented this approach for the GPU as acomplete replacement for *GEMM. The pi and qj values arechosen such that each sub-multiplication fits within the

    currently available GPU memory. Each multiplicationis staged through the GPU, and the results assembledon the CPU. This process is hidden from the user code,which simply sees a standard *GEMM call. In this way,we are able to multiply matrices of arbitrary size usingMGEMM regardless of the available GPU memory.

    B. Bitwise MGEMM algorithm

    Consider the partitioning of a double-precision (DP)floating point number, A = m 2k,

    A Au + Al (27)where Au and Al are single-precision (SP) numbers stor-ing the uppermost nu, and the next lowest nl, significantbits of m, respectively. Consider next the multiplicationof two scalars, A, B. Applying the bitwise partitioning,we can approximate the full DP multiplication as fourSP multiplications,

    AB AuBu + AuBl + AlBu + AlBl (28)

    where the result of each SP multiplication is accumulatedin DP. For eqn. (28) to be exact, nu and nl must now

    be sufficiently small such that no round-off error occurswhen storing each of the four multiplications in SP.

    Anticipating the performance of eqn. (28), we canintroduce an alternative approximation, involving onlythree SP multiplications,

    AB AuBu + AuBl + Al(Bu + Bl) (29)

    where Bu+Bl is replaced by the SP cast ofB. In general,we can expect the round-off error associated with AlBu

    to be of the same order of magnitude, on average, as theround-off error associated with Al(Bu + Bl).

    Finally, we can generalize eqn. (29) for the matrix mul-tiplication,

    AB AuBu +AuBl +Al(Bu +Bl) (30)

    where, for each element ofX {A,B},

    Xij Xuij + Xlij (31)

    As above, the final term may be considered as eithertwo separate multiplications, or as a single multiplica-tion where Bu + Bl is pre-summed. All the multipli-cations may be evaluated efficiently using the CUBLASSGEMM library routines on the GPU. The results may thenbe accumulated in DP on the CPU to yield the final ap-proximation for AB.

    For eqn. (30) to be exact, in addition to the issues forscalar multiplication, we have the additional round-offerrors arising from the multiply-add operations. Specifi-cally, ifA is a M K matrix and B is a K N matrix,AB effectively consists ofMN dot products of length K.

    As K increases, or as the range of magnitudes ofXij be-comes wider, more round-off error will occur due to theaccumulation in SP. We explore these issues more in thebenchmarks below.

    C. Heterogeneous MGEMM algorithm

    A different way to improve the precision is to considerthe magnitudes of the matrix elements from the out-set. That is, we decompose the matrix multiplication,C = AB, by splitting A and B into large and smallcomponents, giving

    C =A

    large +Asmall B

    large +Bsmall

    = ABlarge +AlargeBsmall +AsmallBsmall (32)

    where we have taken the simple approach of introducinga cutoff value, , to define the split. i.e. if |Xij | > , theelement is considered large, otherwise it is consideredsmall.

    The AsmallBsmall term consists entirely of small num-bers, and can be run with reasonable accuracy in singleprecision on the GPU (using the cleaving approach de-scribed above, if needed). The other two terms containlarge numbers, and need to be run in double precision

    to achieve greater accuracy. However, since each of thelarge matrices will often be sparse, these terms eachconsist of a dense-sparse multiplication.

    We only store the nonzero terms of the Alarge andB

    large matrices, cutting the computational complexitysignificantly. Consider

    Cik = AijBlargejk (33)

    Only a few Blargejk will be nonzero, and we consider each in

    turn. For a particular scalar Blargejk , only the kth column

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    5/9

    5

    FIG. 1: Pictorial representation of the heterogeneous MGEMMalgorithm. The bottom-left and top-right panels show the Aand B matrices and their separation into matrices with largeand small elements, given a cutoff parameter . The right-bottom panel shows the components of the final AB matrix,where the green component involves a dense matrix multipli-cation computed with cublas-SGEMM, while the blue com-ponents involve sparse matrix multiplications computed witha daxpy-like algorithm (explained in the text). The blocksfollow the nomenclature shown on eqn. 32.

    ofC will be nonzero and is equal to the product ofBlargejkand the jth column vector ofA. This nonzero column

    vector Cik can be added to the final result, C, and thenext Blargejk considered. A similar process can be applied

    to the AlargeBsmall term (producing row vectors ofC).Again, this approach can be generalized to a full *GEMMimplementation including transposes. Figure 1 shows acartoon of the heterogeneous MGEMM algorithm describedabove.

    IV. MGEMM BENCHMARKS

    We now explore the accuracy and efficiency of the twoMGEMM

    algorithms for various matrix structures. Clearly,the aim is to achieve greater accuracy than a simpleSGEMM call on the GPU, but with minimal extra com-putational cost. Throughout this section, calculationswere made using an Intel Xeon E5472 (Harpertown) CPUclocked at 3.0 GHz attached to an NVIDIA Tesla C1060.

    A. Bitwise MGEMM benchmarks

    First we examine the MGEMM algorithm described inSec. III B. In the following, we chose to only bench-

    0.1

    1

    10

    100

    1000

    10000

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

    Enhancement

    nu bits

    N=1

    N=10

    N=100

    N=1000

    N=10000

    FIG. 2: MGEMM enhancement versus the number of upperbits, nu, (and nl = 23) for N N matrices where N {1, 10, 100, 103, 104}. The matrices were initialized with uni-form random numbers on the range [1, 2].

    mark the implementation using 3 multiplications, as ineqn. (29). Numerous test calculations showed that using4 explicit multiplications gave no significant improvementin accuracy over the scheme with 3 multiplications. Sincethe latter is faster, it should obviously be favoured.

    To quantify the accuracy, we found it useful to intro-duce a new metric which we call the enhancement, .Using DGEMM as the reference, if s and m are the rootmean square (RMS) errors in a matrix element for SGEMMand MGEMM, respectively, then we define = s/m. Notethat when multiplying two N N matrices, the RMSerrors are evaluated from N2 data points. For the small

    matrices, therefore, the number of data points was insuf-ficient to obtain reliable statistics. To remedy this, we re-peated the smaller matrix multiplications (with differentrandom numbers) to get more data points. The numberof repeats was chosen to ensure that the standard errorof the mean was at least two orders of magnitude smallerthan the RMS error.

    Figure 2 shows the enhancement for the multiplicationof random NN matrices with varying number of upperbits, nu, (and nl = 23). Five curves are shown for N {1, 10, 100, 103, 104}. The matrices were initialized withnumbers uniform in the range [1, 2].

    The enhancement is dominated by round-off errors in

    the Au

    B

    u

    terms. On average, these errors are of thesame order of magnitude as the SGEMM errors, thus MGEMMwill only show an improvement when these errors vanish.We can achieve this by first remembering that a single-precision float can only store 24 significant bits in themantissa. Therefore, if we only multiply floats with lessthan 12 significant bits, there will be no round-off errorassociated with the operation.

    For N = 1, the effect is clear in Figure 2: the en-hancement suddenly decreases for nu > 12. For N > 1,the number of bits, nu, must be sufficiently small to alsoprevent round-off errors when accumulating results from

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    6/9

    6

    100

    101

    102

    103

    104

    100

    101

    102

    103

    104

    E

    nhancement

    N

    Range=(1,2)Range=(1,1E2)Range=(1,1E3)Range=(1,1E4)Range=(1,1E5)

    FIG. 3: MGEMM peak enhancement versus matrix size usinguniform random numbers on 5 ranges: [1, 2], [1, 100], [1, 103],[1, 104], and [1, 105]. Optimal values of nu upper bits (andnl = 23) were used for all data points.

    many multiply-add operations. As N becomes larger,this effect is exacerbated, so the optimal nu decreases.

    Decreasing nu, however, introduces larger errors intothe other terms, such as AuBl. These errors are smallerthan the AuBu errors, but increase exponentially as nudecreases. Decreasing nu is therefore favourable untilthe AuBu errors vanish, but there is no advantage todecreasing nu further. In general, the combination ofthese effects means that the peak enhancement decreaseswith N.

    Figure 3 explores the peak enhancement in more de-tail. Each curve uses the optimal nu values and initializes

    A and B with uniform random numbers on one of 5 in-tervals: [1, W] where W {2, 100, 103, 104, 105}.For W = 2, the enhancement decreases from O(103) to

    O(101) as N increases, for reasons discussed above. Forerrors corresponding to a classical random walk, wedexpect to decrease as

    N, which implies a gradient

    of 1/2 in the log-log plot; this is approximately true forW = 2. For W > 2, however, the gradient in the log-log plot is reduced, and the enhancements are almost 2orders of magnitude smaller for all N > 1. In summary,bitwise MGEMM is clearly most effective in a few specialcases, such as for very small matrices, or when all thematrix elements are the same order of magnitude.

    Table I highlights the issues regarding speedup whenrunning bitwise MGEMM on the GPU compared to DGEMMon the CPU for different N. In addition, we catalogue theoptimal values ofnu, as a function ofN and W (the rangeof random numbers [1, W].) As expected, the speedup in-creases as N increases, but unfortunately, bitwise MGEMMis only faster for the very largest matrices. The overheadsof using the GPU, such as data transfer costs and memoryaccess latencies, are well known; for N < 1000 the accel-eration can be marginal (see also Figure 4). In addition,in our simple implementation of eqn. (29), there is also asignificant overhead from splitting and processing the A

    Optimal nu

    N Speedup W = 2 W = 100 W = 103 W = 104 W = 105

    1 0.01 12 12 12 12 12

    10 0.01 10 6 6 6 6

    100 0.07 9 5 4 4 4

    1000 0.52 7 4 4 4 4

    10000 2.87 5 3 3 3 3

    TABLE I: Speedups relative to CPU DGEMM and optimal num-ber of upper bits, nu, when using bitwise MGEMM for variousmatrix sizes with random elements on five intervals [1 ,W].

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    0 2000 4000 6000 8000 10000 1200

    Speed-u

    prelativetoCPU

    DGEMM

    Matrix size

    MGEMM fSalt = 10-2

    MGEMM fSalt = 10-3

    MGEMM fSalt = 10-4

    SGEMM (Cleaver)DGEMM (Cleaver)

    FIG. 4: Speedup for various *GEMM calls as a function of ma-trix size. Most elements were in the range [1, 1], with thesalt values in the range [90, 110]. Times are scaled relativeto running DGEMM on the CPU.

    and B matrices. Eqn. (30) implies three matrix multipli-cations, which are independent of each other. Therefore,it has not escaped our attention that this scheme is paral-lelizable. In summary, we see up to 3 times speedup overCPU DGEMM for large N, but a slowdown for N 1000.

    B. Heterogeneous MGEMM benchmarks

    We now benchmark the second MGEMM algorithm, de-scribed in Sec. III C. Figure 4 shows the speedup for avariety of *GEMM calls on the GPU with respect to thesize, N, of a N N matrix, relative to the time takenfor the corresponding DGEMM call on the CPU.The input matrices were initialized with uniform ran-dom values in the range [1, 1] and then salted witha fraction fsalt of random larger values in the range[90, 110]. Both SGEMM and DGEMM were timed using theGPU (the latter being possible for modern cards, andincluded for reference). Three MGEMM runs were tested,for fsalt = 10

    2, 103 and 104. The size of the cutoffparameter was chosen such that all the salted elementswere considered large. All timings were averaged overten runs.

    Running CUBLAS SGEMM is approximately 17.1 times

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    7/9

    7

    10-7

    10-6

    10-5

    10-4

    10-3

    10-2

    10-1

    1000 3000 5000 7000 9000 11000

    RMSErrorrelativetoCPU

    DGEMM

    Matrix size

    MGEMM fSalt = 10-2

    salt = 102

    MGEMM fSalt = 10-3

    salt = 102

    MGEMM fSalt = 10-4

    salt = 102

    MGEMM fSalt = 10-2

    salt = 104

    MGEMM fSalt = 10-3

    salt = 104

    MGEMM fSalt = 10-4

    salt = 104

    SGEMM (Cleaver)

    FIG. 5: RMS error in a single element for various GEMM callsas a function of matrix size compared to CPU DGEMM. Back-ground elements were in the range [1, 1], with the salt val-ues in the range [90, 110] or [9990, 10010].

    faster than running DGEMM on the CPU for a matrix ofsize 10048 10048, and is even faster for larger matri-ces. This represents an upper bound for the speedups wecan hope to obtain with MGEMM for such matrices. Lever-aging the GPU for small matrices is not effective, dueto well-known overheads such as memory transfer andaccess latencies.

    In contrast, the MGEMM speedups are strongly depen-dent on the fraction fsalt, which determines how much ofthe calculation is done in double-precision on the CPU.For fsalt = 10

    4, the speedups are approximately 10x,but for fsalt = 10

    3, speedups of approximately 2x rela-

    tive to CPU DGEMM are observed. Indeed, for fsalt = 102

    ,MGEMM is actually slower than CPU DGEMM.

    Next we consider the accuracy enhancement when us-ing MGEMM compared to SGEMM. In Figure 5, we show theroot mean square (RMS) errors of each matrix elementrelative to CPU DGEMM for different matrix sizes. Asabove, all the matrices were initialized with uniform ran-dom values in the range [1, 1], but now the salting sizeswere grouped into two ranges: [90, 110] and [9990, 10010].Results are shown for SGEMM and MGEMM for various saltingfractions.

    As expected, SGEMM produces substantial errors. Witha fraction of salted values, fsalt = 1%, in the range

    [90, 110], the errors are of O(0.01) for the medium-sizedmatrices. In contrast, the errors are more than 2 or-ders of magnitude smaller when using MGEMM, and arethe same regardless of the fraction or size of the saltedelements. The limiting MGEMM errors are the same as theSGEMM errors for a pair of unsalted random matrices on[1, 1] because the MGEMM algorithm guarantees that allthe salted contributions are computed on the CPU. In-deed, if the salts were larger or more numerous, the SGEMMerrors would be even larger, but the MGEMM errors wouldbe unchanged. This is in stark contrast to the behaviourof the bitwise MGEMM algorithm.

    C. Comparison of MGEMM schemes

    The behaviour of the two MGEMM algorithms is very dif-ferent. First, the speed of the bitwise MGEMM dependsonly on the size of N, not on the elements ofA or B.Moreover, it is only 3 times faster than CPU DGEMM inthe best case. In contrast, the speed of the heteroge-neous MGEMM algorithm strongly depends on the fractionof large elements. While it is up to 10 times faster thanCPU DGEMM for a 0.01% fraction of large elements, for a1.0% fraction, it is actually slower.

    Concerning the accuracy of the two algorithms, con-sider the following cases. First, suppose the elements ofA andB are random numbers on the interval [1, 2]. Meanerrors produced by bitwise MGEMM are approximately 2 to3 orders of magnitude smaller than SGEMM, depending onthe matrix size (c.f. Figure 2). In contrast, it is obviousthat no value of can be chosen to make the heteroge-neous MGEMM useful. At the other extreme, consider twomatrices with a vast majority of elements in the range[0, 1], and a small minority of scattered elements of un-limited size. Now, the heterogeneous MGEMM will be ex-tremely useful, while the bitwise algorithm is likely tobe quite ineffective due to the problems previously ex-plained.

    For this research, we are interested in acceleratingquantum chemistry. It turns out that the relevant A andB for the RI-MP2 method have large N and very largeW, suggesting that the heterogeneous algorithm shouldbe the method of choice.

    V. RI-MP2 ACCELERATION BENCHMARKS

    We now show how the tools described above can ac-celerate full RI-MP2 quantum chemistry calculations onreal molecules. To achieve this, we accelerate the evalu-ation of eqn. (20) and eqn. (21) using MGEMM running ona GPU. We compare to the use of standard DGEMM BLASon the CPU, and CUBLAS SGEMM on the GPU. For allthese benchmarks, we used an AMD Athlon 5600+ CPUclocked at 2.8 GHz, combined with an NVIDIA TeslaC1060 GPU with 4 GiB of RAM. The matrix cleaver andMGEMM were implemented in a modified version of the Q-Chem 3.1 RI-MP2 code described elsewhere [3].

    As anticipated in the previous section, the bitwise

    MGEMM library does not give useful improvements com-pared to a standard CPU DGEMM. In the evaluation ofeqn. (20), we observed no significant improvement in ac-curacy using bitwise MGEMM, and an enhancement of only2.5 for eqn. (21). Thus we decided not to study the use ofbitwise MGEMM further here. The results of applying theheterogeneous MGEMM algorithm (just MGEMM, from nowon) were much more encouraging.

    For our test systems we chose a set of linear alka-nes (C8H18, C16H34, C24H50, C32H66, C40H82) aswell as two molecules of pharmaceutical interest, theanti-cancer drugs taxol (C47H51NO14) and valinomycin

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    8/9

    8

    (C54H90N6O18). We used the cc-pVDZ and cc-pVTZ [10]atomic orbital basis sets throughout.

    Speedup SGEMM energy error

    Molecule SGEMM DGEMM (kcal mol1)

    C8H18 2.1 1.9 -0.05616

    C16H34 4.5 3.7 -0.12113

    C24H50 6.9 5.2 -0.62661C32H66 9.0 6.4 -0.75981

    C40H82 11.1 7.2 -1.12150

    Taxol 11.3 7.1 -6.26276

    Valinomycin 13.8 7.8 -9.99340

    TABLE II: Speedups using CUBLAS SGEMM and DGEMM and to-tal energy errors relative to CPU DGEMM for various moleculesin a cc-pVDZ basis.

    First, in Table II we benchmark the reference caseof using either CUBLAS SGEMM or DGEMM for each testmolecule using the double- basis set. The table showsthe speedup in computing the RI-MP2 correlation energyand the error relative to a standard CPU calculation (theDGEMM errors are negligible). The speedups and SGEMM er-rors are greater for the larger molecules, with the largestspeedups observed for valinomycin: 13.8x and 7.8x, usingSGEMM and DGEMM, respectively. However, while CUBLASDGEMM gives essentially no loss of accuracy, the SGEMM er-ror is approximately -10.0 kcal mol1, which is well be-yond what is generally accepted as chemical accuracy.

    Quantum chemistry generally aims to achieve a targetaccuracy of 1.0 kcal mol1. In Table III, we explore theperformance of MGEMM using a constant cutoff value of=1.0 to try and reduce the SGEMM errors in Table II.The results show speedups and total energy errors foreach molecule in both the double- and triple- basis sets.In this particular case, we have limited the GPU to useonly 256 MiB of RAM to mimic the capability of oldercards and emphasize the use of the MGEMM cleaver. Thiswill naturally result in a loss of speedup compared toutilizing a larger GPU memory. In the case of taxol thereduction is approximately 20%.

    Speedup Energy error

    (kcal mol1)

    Molecule Double- Triple- Double- Triple-

    C8H18 1.9 2.7 -0.01249 -0.03488C16H34 3.8 5.6 -0.00704 -0.04209

    C24H50 5.8 8.2 -0.14011 -0.33553

    C32H66 7.9 9.2 -0.08111 -0.29447

    C40H82 9.4 10.0 -0.13713 -0.51186

    Taxol 9.3 10.0 -0.50110 -1.80076

    Valinomycin 10.1 - -1.16363 -

    TABLE III: MGEMM speedups and total energy errors with re-spect to CPU DGEMM for various molecules in a cc-pVDZ andcc-pVTZ basis.

    Looking at Table III, the trends are the same as inTable II, but the MGEMM errors are approximately an or-der of magnitude less than the SGEMM errors (for thelarger molecules). For valinomycin in the cc-pVDZ ba-sis, the SGEMM speedup is reduced from 13.8x to 10.1xusing MGEMM, but the error in the total energy is also re-duced from -10.0 kcal mol1 to -1.2 kcal mol1, whichis now very close to chemical accuracy. Moreover, while

    CUBLAS DGEMM clearly has the advantage (when avail-able) of high accuracy, if -1.2 kcal mol1 is deemed anacceptable accuracy, MGEMM could be favoured since theDGEMM speedup is only 7.8x compared to 10.1x. Note thatthe errors are larger for the triple- basis simply becauseit requires more computation, which leads to greater er-ror accumulation; this is a general issue not only forhigher quality basis sets, but also for larger moleculesusing lower quality basis sets.

    VI. CONCLUSION

    We have demonstrated how computational quantumchemistry can benefit from leveraging the power of graph-ical processing units in a very simple way. Our new toolsare easy to incorporate into existing legacy codes wherematrix multiplications involve a substantial fraction ofthe overall computational cost. Clearly, more efficientuse of the GPU can be achieved in special cases by devot-ing time to rewriting and redesigning the algorithms forGPU architectures. However, this is often not a feasibleoption in practice, especially for a mature discipline suchas quantum chemistry where we typically employ largeprogram suites representing years or decades of coding

    effort.In summary, our contribution has been the develop-ment, testing and benchmarking of a general black-boxapproach for the efficient GPU acceleration of matrix-matrix multiplications. In particular, our new library,which we call MGEMM [11], works for matrices of arbitrarysize, even if too large for the whole computation to beheld in the GPUs onboard memory. Moreover, while as-suming only single-precision operations to be available onthe GPU, we have demonstrated two algorithms wherebyorders of magnitude greater accuracy can be achievedwithin the context of a mixed-precision approach.

    To illustrate the utility of MGEMM for quantum chem-

    istry, we combined it with the Q-Chem 3.1 [9] programpackage and computed the MP2 correlation energy ofthe 168-atom valinomycin molecule in a cc-pVDZ basisset. Using SGEMM, MGEMM and (GPU) DGEMM, we observedspeedups of 13.8x, 10.1x and 7.8x, respectively, whileMGEMM was an order of magnitude more accurate thanSGEMM, thus achieving chemical accuracy.

    Traditionally, GPUs have only had single-precision(SP) support and it remains true that the vast majority ofGPU cards worldwide currently do not have double pre-cision (DP) capability. Indeed, we are interested in usingcommodity GPUs within a grid-computing environment,

  • 8/3/2019 Mark A. Watson et al- Accelerating correlated quantum chemistry calculations using graphical processing units

    9/9

    9

    such as that promoted by the Berkeley Open Infrastruc-ture for Network Computing (BOINC) [12]. GPUs in atypical BOINC client do not have DP support, yet com-prise a formidable resource worldwide.

    Nevertheless, DP devices and co-processors are nowavailable. Moreover, the trend seems to be towards therelease of more advanced DP support in the future, suchas the next-generation GPU from NVIDIA, currently

    code-named Fermi. Fermi will reportedly have a DPpeak performance which is only a factor of 2 less thanthe SP performance. (For NVIDIAs C1060, the ratio isapproximately 12).

    It is therefore valid to question the potential of mixed-precision algorithms for next-generation GPU architec-tures. We believe this is an open issue. For example,practical calculations on GPUs are very often bound bymemory bandwidth, rather than raw operation count.While DP cards are getting faster, this I/O bottleneckis likely to remain. In these cases, the transfer and pro-cessing of only SP data could effectively double the per-

    formance compared to naive DP calculations. Overall,we believe that mixed-precision algorithms could remainimportant for applications were the highest performanceis a priority.

    VII. ACKNOWLEDGMENTS

    The authors would like to thank L. Vogt for help-ful discussions and her contributions to the numericalbenchmarks. Financial support was provided by the NSFCyber-Enabled Discoveries and Innovations (CDI) Ini-tiative Award PHY-0835713, as well as NVIDIA. R.O.A.wishes to thank CONACyT and Fundacion Harvard enMexico for additional financial support. Technical sup-port by the High Performance Technical Computing Cen-ter at the Faculty of Arts and Sciences of Harvard Uni-versity and the National Nanotechnology InfrastructureNetwork Computation project was appreciated.

    [1] K. Yasuda, J. Comput. Chem. 29, 334 (2008), URLhttp://dx.doi.org.ezp-prod1.hul.harvard.edu/10.

    1002/jcc.20779.[2] I. S. Ufimtsev and T. J. Martinez, J. Chem. The-

    ory Comput. 4, 222 (2008), ISSN 1549-9618, URLhttp://pubs3.acs.org/acs/journals/doilookup?in\

    _doi=10.1021/ct700268q.[3] L. Vogt, R. Olivares-Amaya, S. Kermes, Y. Shao,

    C. Amador-Bedolla, and A. Aspuru-Guzik, J. Phys.Chem. A 112, 20497 (2008), ISSN 1089-5639, URLhttp://pubs3.acs.org/acs/journals/doilookup?in\

    _doi=10.1021/jp0776762.[4] R. Olivares-Amaya, M. A. Watson, R. G. Edgar, L. Vogt,

    Y. Shao, and A. Aspuru-Guzik, Journal of ChemicalTheory and Computation p. 091214115953059 (2009),ISSN 1549-9618, URL http://pubs.acs.org/doi/abs/10.1021/ct900543q.

    [5] T. Helgaker, P. Jrgensen, and J. Olsen, MolecularElectronic-Structure Theory (Wiley, Chichester, 2000).

    [6] M. Feyereisen, G. Fitzgerald, and A. Komornicki, Chem.Phys. Lett. 208, 359 (1993).

    [7] X. Li, J. Demmel, D. Bailey, G. Henry, Y. Hida, J. Iskan-dar, W. Kahan, S. Kang, A. Kapur, M. Martin, et al.,ACM Transactions on Mathematical Software 28, 152(2002).

    [8] Y. Hida, X. S. Li, and D. H. Bailey, in ARITH 01:Proceedings of the 15th IEEE Symposium on ComputerArithmetic (IEEE Computer Society, Washington, DC,USA, 2001), p. 155.

    [9] Y. Shao, L. Fusti-Molnar, Y. Jung, J. Kussmann,

    C. Ochsenfeld, S. T. Brown, A. T. B. Gilbert, L. V.Slipchenko, S. V. Levchenko, D. P. ONeill, et al., Phys.Chem. Chem. Phys. 8, 3172 (2006).

    [10] T. Dunning Jr., J. Chem. Phys. 90, 1007 (1989).[11] scigpu-gemm v0.8, http://scigpu.org.[12] J. Bohannon, Science 308, 310 (2005).