Ivan S. Ufimtsev and Todd J. Martínez- Graphical Processing Units for Quantum Chemistry

26 This arTicle has been peer-reviewed. Computing in SCienCe & engineering

N o v e lA r c h i t e c t u r e s

Graphical Processing Units for Quantum Chemistry

Ivan S. Ufimtsev and Todd J. MartínezUniversity of Illinois at Urbana-Champaign

The authors provide a brief overview of electronic structure theory and detail their experiences implementing quantum chemistry methods on a graphical processing unit. They also analyze algorithm performance in terms of floating-point operations and memory bandwidth, and assess the adequacy of single-precision accuracy for quantum chemistry applications.

I n 1830, Auguste Comte wrote in his work Philosophie Positive: “Every attempt to em-ploy mathematical methods in the study of chemical questions must be considered

profoundly irrational and contrary to the spirit of chemistry. If mathematical analysis should ever hold a prominent place in chemistry—an aberra-tion which is happily almost impossible—it would occasion a rapid and widespread degeneration of that science.” Fortunately, Comte’s assessment was far off the mark and, instead, the opposite has occurred. Detailed simulations based on the principles of quantum mechanics now play a large role in suggesting, guiding, and explaining ex-periments in chemistry and materials science. In fact, quantum chemistry is a major consumer of CPU cycles at national supercomputer centers, with the field’s rise to prominence largely due to early demonstrations of quantum mechanics ap-plied to chemical problems and the tremendous advances in computing power over the past de-cades—two developments that Comte could not have foreseen.

However, limited computational resources remain a serious obstacle to the application of quantum chemistry in problems of widespread importance, such as the design of more effective drugs to treat diseases or new catalysts for use in applications such as fuel cells or environmental remediation. Thus, researchers have a consider-able impetus to relieve this bottleneck in any way possible, both by developing new and more ef-fective algorithms and exploring new computer architectures. In our own work, we’ve recently begun exploring the use of graphical processing units (GPUs), and this article presents some of our experiences with GPUs for quantum chemistry.

GPu ArchitectureLow precision (generally, 24-bit arithmetic) and limited programmability stymied early attempts to use GPUs for general-purpose scientific com-puting.1–4 However, the release of the Nvidia G80 series and the compute unified device architec-ture (CUDA) application programming interface (API) have ushered in a new era in which these difficulties are largely ameliorated. The CUDA API lets developers control the GPU via an exten-sion of the standard C programming language (as opposed to specialized assembler or graphics-ori-ented APIs, such as OpenGL and DirectX). The G80 supports 32-bit floating-point arithmetic,

1521-9615/08/$25.00 © 2008 ieee

CopubliShed by the ieee CS and the aip

Authorized licensed use limited to: IEEE Xplore. Downloaded on November 3, 2008 at 02:16 from IEEE Xplore. Restrictions apply.

november/deCember 2008 27

which is largely (but not entirely) compliant with the IEEE-754 standard. In some cases, this might not be sufficient precision (many scientific appli-cations use 64-bit or double precision), but Nvidia has already released its next generation of GPUs that supports 64-bit arithmetic in hardware. We performed all the calculations presented in this article on a single Nvidia GeForce 8800 GTX card using the CUDA API (www.nvidia.com/ob-ject/cuda_home.html offers a detailed overview of the hardware and API). We sketch some of the important concepts here.

As depicted schematically in Figure 1, the Ge-Force 8800 GTX consists of 16 independent streaming multiprocessors (SMs), each comprised of eight scalar units operating in SIMD fashion and running at 1.35 GHz. The device can process a large number of concurrent threads; these threads are organized into a one- or two-dimensional (1- or 2D) grid of 1-, 2-, or 3D blocks with up to 512 threads in each block. Threads in the same block are guaranteed to execute on the same SM and have ac-cess to fast on-chip shared memory (16 Kbytes per SM) for efficient data exchange. Threads belong-ing to different blocks can execute on different SMs and must exchange data through the GPU DRAM (768 Mbytes), which has much larger latency than shared memory. There’s no efficient way to syn-chronize thread block execution—that is, in which order or on which SM they’ll be processed—thus, an efficient algorithm should avoid communication between thread blocks as much as possible.

The thread block grid can contain up to 65,535 blocks in each dimension. Every thread block has its own unique serial number (or two numbers, if the grid is 2D); likewise, each thread also has a set of indices, identifying it within a thread block. Together, the thread block and thread serial num-bers provide enough information to precisely identify a given thread and thereby split com-putational work in an application. Much of the challenge in developing an efficient algorithm on the GPU involves determining an ideal mapping between computational tasks and the grid/thread block/thread hierarchy. Ideal mappings lead to load balance with interthread communication re-stricted to within the thread blocks.

Because the number of threads running on a GPU is much larger than the total number of pro-cessing units (to fully utilize the device’s computa-tional capabilities and efficiently hide the DRAM’s access latency, the program must spawn at least 104 threads), the hardware executes all threads in time-slicing. The thread scheduler splits the thread blocks into 32-thread warps, which SMs

then process in SIMD fashion, with all 32 threads executed by eight scalar processors in four clock cycles. The thread scheduler periodically switches between warps, maximizing overall application performance. An important point here is that once all the threads in a warp have completed, this warp is no longer scheduled for execution, and no load-balancing penalty is incurred.

Another important consideration is the amount of on-chip resources available for an active thread. Active means that the thread has GPU context (registers and so on) attached to it and is included in the thread scheduler’s “to do” list. Once a thread is activated, it won’t be deactivated until all of its instructions are executed. A G80 SM can support up to 768 active threads (24 warps), and the warp-switching overhead is negligible compared to the time required to execute a typical instruction. Such cost-free switching is possible because every thread has its own context, which implies that the whole register space (32 Kbytes per SM) is evenly distributed among active threads (for 768 active threads, every thread has 10 registers available). If the threads need more registers, fewer threads are activated, leading to partial SM occupation. This important parameter determines if the GPU DRAM access latency can be efficiently hidden (a large number of active threads means that al-though some threads are waiting for data, others can execute instructions and vice versa). Thus, any GPU kernel (a primitive routine each GPU thread executes) should consume as few registers as possible to maximize SM occupation.

Quantum chemistry overviewTwo of the most basic questions in chemistry are, “where are the electrons?” and “where are the nu-clei?” Electronic structure theory—that is, quan-tum chemistry—focuses on the first one. Because the electrons are very light, we must apply the laws of quantum mechanics, which are described

Nvidia GeForce 8800 GTX

GPU DRAM (768 Mbytes)

Streaming multiprocessors

SM 1 SM 2 SM 15 SM 16…

Streaming multiprocessor

Shared memory (16 Kbytes)

SIMD streaming processors

SP 1 SP 2 SP 7 SP 8…

Instructionunit

Figure 1. Schematic block diagram of the Nvidia GeForce 8800 GTX. It has 16 streaming multiprocessors (SMs), each containing 8 SIMD streaming processors and 16 Kbytes of shared memory.


28 Computing in SCienCe & engineering

with an electronic wave function determined by solving the time-independent Schrödinger equa-tion. As usual in quantum mechanics, this wave function’s absolute square is interpreted as a prob-ability distribution for electron positions. Once we know the electronic distribution for a fixed nuclear configuration, it’s straightforward to cal-culate the resulting forces on the nuclei. Thus, the answer to the second question follows from the answer to the first—through either a search for the nuclei arrangement that minimizes energy (molecular geometry optimization) or solution of the classical Newtonian equations of motion. (It’s also possible—and in some cases, necessary—to solve quantum mechanical equations of motion for the nuclei, but we don’t consider this further here.) The great utility of quantum chemistry comes from the resulting ability to predict mo-lecular shapes and chemical rearrangements.

Denoting the set of all electronic coordinates as r and all nuclear coordinates as R, we can write the electronic time-independent Schrödinger equation for a molecule as

H Eelec elecr R r R R r R, , ,( ) ( )= ( ) ( )ψ ψ , (1)

where

H r R,( ) is the electronic Hamiltonian op-erator describing the electronic kinetic energy as well as the Coulomb interactions between all elec-trons and nuclei, ψelec r R,( ) is the electronic wave function, and E R( ) is the total energy. This total energy depends only on the nuclear coordinates and is often referred to as the molecule’s potential energy surface.

For most molecules, exact solution of Equa-tion 1 is impossible, so we must invoke approxi-mations. Researchers have developed numerous such approximate approaches, the simplest of which is the Hartree-Fock (HF) method.5 In HF theory, we write the electronic wave function as a single antisymmetrized product of orbitals, which are functions describing a single electron’s prob-ability distribution. Physically, this means that the electrons see each other only in an averaged sense while obeying the appropriate Fermi statis-tics. We can obtain higher accuracy by including many antisymmetrized orbital products or by us-ing density functional theory (DFT) to describe the detailed electronic correlations present in real molecules.6 We consider only HF theory in this article because it illustrates many key computa-tional points.

We express each electronic orbital φi r( ) in HF theory as a linear combination of K basis func-tions χµ r( ) specified in advance:

φ χµ µµ

i i

Kr C r( )= ( )

=∑

1. (2)

Notice that we no longer write the electronic co-ordinates in boldface to emphasize that these are a single electron’s coordinates. The computational task is then to determine the linear coefficients Ciμ. Collecting these coefficients into a matrix C and introducing the overlap matrix S with elements

S r r drµυ µ υχ χ= ( ) ( )∫ 3, (3)

we can write the HF equations as

F C C = SC( ) ε , (4)

where ε is a diagonal matrix of one-electron orbit-al energies. The Fock matrix F is a one-electron analog of the Hamiltonian operator in Equation 1, describing the electronic kinetic energy, the electron–nuclear Coulomb attraction, and the averaged electron–electron repulsion. Because F depends on the unknown matrix C, we solve for the unknowns C and ε with a self-consistent field (SCF) procedure. After guessing C, we construct the Fock matrix and solve the generalized eigen-value problem in Equation 4, giving ε and a new set of coefficients C. The process iterates until C remains unchanged within some tolerance—that is, until C and F are self-consistent.

The dominant effort in the formation of F lies in the evaluation of the two-electron repulsion inte-grals (ERIs) representing the Coulomb (J) and ex-change (K) interactions between pairs of electrons. The exchange interaction is a nonclassical Cou-lomb-like term arising from the electronic wave function’s antisymmetry. Specifically, we construct the Fock matrix for a molecule with N electrons as

F C H J C K C( ) ( ) ( )= + −core12

, (5)

where Hcore includes the electronic kinetic energy and the electron–nuclear attraction. (The equa-tions given here are specific to closed-shell singlet molecules, with no unpaired electrons.) The Cou-lomb and exchange matrices are given by

J Pµυ λσλσ

µυ λσ=∑ ( | ) (6)

K Pµυ λσλσ

µλ σν=∑ ( | ), (7)

in terms of the “density matrix” P and the two-electron integrals (μν|λσ),



P C Cλσ λ σ==∑2

1

2

i ii

N / (8)

( | )µν λσχ χ χ χµ ν λ σ=( ) ( ) ( ) ( )

−

⌠

⌡⌠

⌡

r r r r

r r1 1 2 2

1 2

dr dr13

23 . (9)

Usually, we choose the basis functions as Gaussians centered on the nuclei, which leads to analytic expressions for the two-electron in-tegrals. Nevertheless, K 4 such integrals must be evaluated, where K grows linearly with the size of the molecule under consideration. In prac-tice, many of these integrals are small and can be neglected, but the number of non-negligible integrals still grows faster than O(K2), making their evaluation a critical bottleneck in quantum chemistry. We can calculate and store (conven-tional SCF) the ERIs for use in constructing the J and K matrices during the iterative SCF procedure, or we can recalculate them in each it-eration (direct SCF). The direct SCF method is often more efficient in practice because it mini-mizes the I/O associated with reading and writ-ing ERIs.

The form of the basis functions χμ(r) is arbitrary in principle—as long as the basis set is sufficiently flexible to describe the electronic orbitals. The natural choice for these functions is the Slater-type orbital (STO), which comes from the exact analytic solution of the electronic structure prob-lem for the hydrogen atom:

χµ µ µ µαµ µSTO l m n

x x y y z z err R( )∝ −( ) −( ) −( ) − − , (10)

where Rμ is the position of the nucleus on which the basis function is centered (with components xμ, yμ, and zμ), and the integers l, m, and n repre-sent the orbital’s angular momentum. The orbit-al’s total angular momentum, ltotal, is given by the sum of these integers and is often referred to as s, p, and d, for ltotal = 0, 1, and 2, respectively. Un-fortunately, it’s difficult to evaluate the required two-electron integrals using these basis func-tions, so we use Gaussian-type orbitals (GTOs) in their stead:

χµ µ µ µαµ µGTO l m n

x x y y z z err R( )∝ −( ) −( ) −( ) − −

2

. (11)

Relatively simple analytic expressions are avail-able for the required integrals when using GTO basis sets, but the functional form is qualita-tively different. To mimic the more physically motivated STOs, we typically contract these GTOs as

χ χ

χ

µ µ

µ µ

STO GTO contracted

i iGTO pd

r r( )≈ ( )=,

, rrimitve

i

N

r( )=∑

1

µ

. (12)

This procedure leads to two-electron integrals over contracted basis functions, which are given as sums of integrals over primitive basis func-tions. Unlike the elements of the C matrix, the contraction coefficients dμi aren’t allowed to vary during the SCF iterative process. The number of primitives in a contracted basis function, Nμ, is the contraction length and usually varies between one and eight. Given this basis set construction, we can now talk about ERIs over contracted or primitive basis functions:

( | )

|

µν λσ

µ ν λ σ

σ

=

==∑ d d d d pq rsp q r ss

N

r 11

NN

q

N

p

Nλνµ

∑∑∑== 11

, (13)

where square brackets denote integrals over prim-itive basis functions and parentheses denote inte-grals over contracted basis functions.

Several algorithms can evaluate the primitive integrals [pq|rs] for GTO basis sets. We won’t dis-cuss these in any detail here except to say that we’ve used the McMurchie-Davidson scheme,7 which requires relatively few intermediates per integral. The resulting low memory requirements for the kernels let us maximize SM occupancy; the opera-tions involved in evaluating the primitive integrals also include evaluation of reciprocals, square roots, and exponentials, in addition to simple arithmetic operations such as addition and multiplication.

GPu Algorithms for eri evaluationIn other work, we’ve explored three different algo-rithms to evaluate the O(K4) ERIs over contracted basis functions and store them in the GPU mem-ory.8 Here, we summarize the algorithms and comment on several aspects of their performance. As a test case, we consider a molecular system composed of 64 hydrogen atoms arranged on a 4 × 4 × 4 lattice. We use two basis sets—the first (denoted STO-6G) has six primitive functions for each contracted basis function with one con-tracted basis function per atom, and the second (denoted 6-311G) has three contracted functions per atom in combinations of three, one, and one primitive basis functions, respectively. These two basis sets represent highly contracted or relatively uncontracted basis sets and serve to show how the degree of contraction in the basis set affects algo-rithm performance. For the hydrogen atom lattice



test case, the number of contracted basis functions is 64 and 192 for the STO-6G and 6-311G basis sets, respectively, which leads to O(106) and O(108) ERIs over contracted basis functions.

As we can see in Equation 9, the ERIs have sev-eral permutation symmetries—for example, inter-change of the first or last two indices in the (μν|λσ) ERI doesn’t change the integral’s value. Thus, we can represent the contracted ERIs as a square ma-trix of dimension K(K + 1)/2 × K(K + 1)/2, as Figure 2 shows—here, the rows and columns represent unique μν and λσ index pairs. Furthermore, we can interchange the first pair of indices with the last pair without changing the ERI value—that is, (μν|λσ) = (λσ |μν). This implies that the ERI ma-trix is symmetric, and only the ERIs on or above the main diagonal need to be calculated.

Figure 2 shows the primitive integrals contrib-uting to each contracted integral as small squares (see the blow up labeled “primitive integrals”). We’ve simplified here to the case in which each contracted basis function is a linear combina-tion of the same number of primitives. In realis-tic cases, each of the contracted basis functions can involve a different number of primitive basis functions.

This organization of the contracted ERIs im-mediately suggests three different mappings of the computational work to thread blocks. We could assign

a thread to each contracted ERI (1T1CI, “1 Thread-1 Contracted Integral” in Figure 2);a thread block to each contracted ERI (1B1CI, “1 Block-1 Contracted Integral” in Figure 2); ora thread to each primitive ERI (1T1PI, “1 Thread-1 Primitive Integral” in Figure 2).

We’ve implemented all three of these schemes on the GPU; the grain of parallelism and the de-gree of load balancing differed in all three cases. The 1T1PI scheme is the most fine-grained and provides the largest number of threads for cal-culation, and the 1T1CI scheme is the least fine-grained, providing a larger amount of work for active threads.

In the 1T1CI scheme, each thread calculates its contracted integral by directly looping over all primitive ERIs and accumulating the results according to Equation 13. Once the primitive ERI evaluation and summation completes, the contracted integral is stored in the GPU memory.

•

•

•

This block calculates (11|11) integral

1 block–1 contracted integral

Thread(0)

[11|11]

Thread(1)

[11|12]

Thread(62)idle

Thread(63)idle

…

This block contributesto (13|23) integral

1 thread–1 primitive integral

Thread(0,0)

[11|13]

Thread(3,0)

[11|33]

Thread(0,3)

[22|13]

Thread(3,3)

[22|33]

…

…

… …

This block contributesto (23-33|23-33) integrals

1 thread–1 contracted integral

Thread(0,0)

[23|23]

Thread(1,0)

[23|33]

Thread(0,1)

[33|23]

Thread(1,1)

[33|33]

11

12

13

22

23

33

11 12 13 22 23 33

Redundantcontracted integrals

KpKqKrKsprimitive integrals

Figure 2. Schematic of three different mapping schemes for evaluating ERIs on the GPU. The large square represents the matrix of contracted integrals; small squares below the main diagonal (blue) represent integrals that don’t need to be computed because the integral matrix is symmetric. Each of the contracted integrals is a sum over primitive integrals. The mapping schemes differ in how the computational work is apportioned—red squares superimposed on the integral matrix denote work done by a representative thread block, and the three blow ups show how the work is apportioned to threads within the thread block.



Neighboring integrals can have different numbers of contributing primitives and hence a different number of loop cycles. When the threads respon-sible for these neighboring integrals belong to the same warp, they execute in SIMD fashion, which produces load misbalancing. We can minimize the impact by further organizing the integrals into subgrids according to the contraction length of the basis functions involved. In this case, all threads in a warp have similar workloads (thus minimizing load-balancing issues), but it requires further reorganization of the computation with both programming and runtime overhead; the latter, however, is usually small when compared to typical computation timings.

The 1B1CI mapping scheme is finer-grained and maps each contracted integral to a whole thread block rather than a single thread. This organiza-tion avoids the load-imbalance issues inherent to the 1T1CI algorithm because distinct thread blocks never share common warps. Within a block, we have several ways to assign primitive integrals to GPU threads. We chose to assign them cyclically, with each successive term in the sum of Equation 13 mapped to a successive GPU thread—when the last thread is reached, the subsequent integral is assigned to the first thread, and so on. Because all threads compute their integrals, the latter are summed using the shared on-chip memory, and the final result is stored in GPU DRAM.

Unfortunately, the 1B1CI scheme sometimes experiences load-balancing issues that are diffi-cult to eliminate. Consider a contracted integral comprised of just one primitive integral—such a situation is possible when all the basis functions have unit contraction length (that is, they aren’t contracted at all). In this case, only one thread in the whole block will have work assigned to it, but because the warps are processed in SIMD fashion, the other 31 threads in the warp will ex-ecute the same set of instructions and waste the computational time. Direct tests, performed on a system with a large number of weakly contracted integrals, confirm this prediction.

The 1T1PI mapping scheme exhibits the finest-grain level of parallelism of all the schemes present-ed. Unlike the two previous approaches, the target integral grid has primitive rather than contracted integrals, and each GPU thread calculates just one primitive integral, no matter which contracted in-tegrals it contributes to. As soon as we calculate and store all the primitives on the GPU, another GPU kernel further transforms them to the final array of contracted integrals. The second step isn’t required in 1T1CI and 1B1CI algorithms because all required

primitives are stored either in registers or shared memory and thus are easily assembled into a con-tracted integral. In contrast, in the 1T1PI scheme, those primitives constituting the same contracted integral can belong to different thread blocks run-ning on different SMs. In this case, data exchange is possible only through the GPU DRAM, incurring hundreds of clock cycles of latency.

Table 1 shows benchmark results for the 64 hydrogen atom lattice. As mentioned earlier, we used two different basis sets to determine the contraction length’s effect. For the weakly con-tracted basis set (6-311G), we found that 1B1CI mapping performs poorly (as predicted), mostly because of the large number of “empty” threads that still execute instructions due to the SIMD hardware model. For this case, we estimated that the 1B1CI algorithm possesses 4.2X computa-tional overhead, assuming each warp contains 32 threads. The 1T1PI mapping was the fastest, but two considerations are important here. First, the summation of Equation 13 doesn’t produce much overhead—most of the contracted basis functions consist of a single primitive. Second, the GPU integral evaluation kernel is relatively simple and consumes a small number of registers, which al-lows more active threads to run on an SM and hence provides better instruction pipelining. For the highly contracted basis set (STO-6G), the situation is reversed: the 1T1PI algorithm is the slowest because of the summation of the primitive integrals, which is more likely to require commu-nication across thread blocks. The 1B1CI scheme avoids this overhead and distributes the work more evenly (all contracted integrals require the same number of primitive ERIs because all basis functions have the same contraction length). We found that the 1T1CI algorithm represents a com-promise that’s less sensitive to the degree of basis set contraction. In both cases, it’s either “almost the fastest” or simply “the fastest” and thus would be recommended for conventional SCF.

An additional issue is the time required to move the contracted integrals between the GPU and CPU main memory (in practice, the integrals rarely fit in the GPU DRAM). Table 1 shows that the time for this GPU–CPU transfer can exceed the integral evaluation time for weakly contracted basis sets. An alternate approach that avoids trans-ferring the ERIs would clearly be advantageous.

By substituting Equation 13 into Equations 6 and 7, we can avoid the formation of the con-tracted ERIs completely. This is the usual strategy when we use direct SCF methods to re-evalu-ate ERIs in every step of the SCF procedure. In



this case, we avoid the Achilles’ heel of the 1T1PI scheme—formation of the contracted ERIs—so it thus becomes the recommended scheme. We’ve implemented construction of the J and K matrices on the GPU concurrent with ERI evaluation via the 1T1PI scheme. This has the added advantage of avoiding CPU–GPU transfer of the ERIs—the J and K matrices contain only O(K2) elements, compared to the O(K4) ERIs. Due to limited space, we won’t discuss the details of the algorithms here, but we will present some results that demonstrate the accuracy and performance achieved so far.

As mentioned earlier, the basis functions used in quantum chemistry have an associated angular momentum—that is, the polynomial prefactor in Equations 10 and 11. In the hydrogen lattice test case, all basis functions were of s type, meaning no polynomial prefactor. For atoms heavier than hydrogen, it becomes essential to also include higher angular momentum functions. Treating these efficiently requires computing all compo-nents such as px, py, and pz simultaneously. In the context of the GPU, this means that we should write separate kernels for ERIs that have different angular momentum combinations. These kernels will involve more arithmetic operations as the an-gular momentum increases, simply because more ERIs are computed simultaneously. We’ve written such kernels for all ERIs in basis sets including s and p type basis functions.

To better quantify the GPU’s performance, we’ve investigated our algorithms for J matrix construction. The GPU’s peak performance is 350 Gflops, and we were curious to see how close our algorithms came to this theoretical limit. Table 2 shows performance results for a subset of the ker-nels we coded. For each kernel, we counted the cor-responding number of floating-point instructions it executed. We counted all the instructions as 1 Flop, excluding MAD, which we assumed to take 2 flops. We then hand-counted the resulting float-ing-point operations from the compiler-generated PTXAS file (an intermediate assembler-type code that the compiler transforms to actual machine in-structions). To evaluate the application’s DRAM

bandwidth, we also counted the number of mem-ory operations (Mops) each thread needed to ex-ecute to load data from the GPU main memory for integral batch evaluation. We counted each 32-bit load instruction as 1 Mop and the 64- and 128-bit load instructions as 2 and 4 Mops, correspond-ingly. Because we use texture memory (which can be cached), the resulting bandwidth is likely over-estimated. In our algorithm, we found that using texture memory was even more efficient than the textbook “global memory load → shared memory store → synchronization → broadcast through shared memory” scheme. This is due to synchroni-zation overhead, which hinders effective parallel-ization. We also determined the number of active threads (the hardware supports 768 at most) that are actually launched on every streaming multipro-cessor. Finally, we determined the number of reg-isters each active thread requires, which, in turn, determines GPU occupancy as discussed earlier.

As expected, the kernels involving higher angular momentum functions require more floating-point operations and registers, but the need for more reg-isters per thread leads to fewer threads being ac-tive. Although sustained performance is less than 30 percent of the theoretical peak value, the GPU performance is still impressive compared to com-modity CPUs—for example, a single AMD Opter-on core demonstrates 1 to 3 Gflops in the Linpack benchmark. Given that a general quantum chemis-try code is far less optimized than Linpack, we can estimate 1 Gflop as the upper bound for integral generation performance on this CPU. In contrast, we achieve 70 to 100 Gflops on the GPU.

Comparing the performance in Table 2 for the sspp and pppp kernels, we can see that the GPU performance grows with arithmetic com-plexity for a fixed number of memory accesses, which suggests that our application is memory- bound on the GPU. Furthermore, the total memory bandwidth observed (although some-times overestimated due to texture caching) is close to the 80 Gbytes/s peak value (in practice, we usually got 40 to 70 Gbytes/s bandwidth in global memory reads). To further verify the

table 1. two-electron integral evaluation of the 64 hydrogen atom lattice on a GPu using three algorithms (1B1ci, 1t1ci, and 1t1Pi).

Basis set GPu 1B1ci GPu 1t1ci GPu 1t1Pi GPu–cPu transfer* GAMess**

6-311G 7.086 s 0.675 s 0.428 s 0.883 s 170.8 s

STO-6G 1.608 s 1.099 s 2.863 s 0.012 s 90.6 s

*The amount of time required to copy the contracted integrals from the GPU to CPU memory **The same test case using the GAMESS program package on a single Opteron 175 CPU for comparison



conclusion that our application is memory-bound, we performed a direct test. Out of the 48 to 84 bytes required to evaluate each batch, we left only 24 bytes that were vitally impor-tant for the application to run and replaced the other quantities with constants. We evaluated the resulting performance and present it in pa-rentheses in Table 2’s “performance” column. Although the number of arithmetic operations was unchanged, the Gflops achieved increased by a factor of two or more, which clearly dem-onstrates our conclusion’s correctness.

In anticipation of upcoming double-precision hardware, we’re pleased that our algorithms are currently memory-bound. Although the mem-ory bandwidth will decrease by a factor of two (due to 64- instead of 32-bit number represen-tation) when the next generation of GPUs uses double-precision, the more dramatic decrease will come in arithmetic performance. However, we anticipate that the increased arithmetic in-tensity won’t much affect our algorithms—in-stead, it will only be roughly a factor of two slower in double precision.

Our code, which is still under development, successfully competes with modern, well- optimized, general-purpose quantum chemis-try programs such as GAMESS.9 We performed benchmark tests on the following molecules us-ing the 3-21G basis set: caffeine (C8N4H10O2), cholesterol (C27H46O), buckyball (C60), taxol (C45NH49O15), and valinomycin (C54N6H90O18). Figure 3 represents all these molecules, and Table

3 summarizes the benchmark results. The GPU is up to 93 times faster than a single 3.0-GHz Intel Pentium D CPU for these molecules.

The GPU we used for these tests supports only 32-bit arithmetic operations, meaning that we can only expect six or seven significant figures of ac-curacy in the final results. This might not always be sufficient for quantum chemistry applications, where “chemical accuracy” is typically considered to be 10–3 atomic units. As we can see by compar-ing the GPU and GAMESS electronic energies in Table 3, this level of accuracy isn’t always achieved. Fortunately, the next generation of GPUs will provide hardware support for double-precision, and we expect our algorithms will only be two times slower because they’re currently limited by the GPU’s memory bandwidth and don’t saturate the GPU’s floating-point capabilities.

I n this article, we’ve demonstrated that GPUs can significantly outpace commod-ity CPUs in the central bottleneck of most quantum chemistry problems—evaluation

of two-electron repulsion integrals and subse-quent Coulomb and exchange operator matrix for-mations. Speedups on the order of 100 times are readily achievable for chemical systems of practical interest, and the inherent high level of parallelism results in complete elimination of interblock com-munication during Fock matrix formation, mak-ing further parallelization over multiple GPUs an obvious step in the near future.

table 2. integral evaluation GPu kernel specifications and performance results.

Kernel Floating-point operations

Memory operations

registers per thread

Active threads per sM

Performance (Gflops)

Bandwidth (Gbytes/s)

ssss 30 12 20 384 88 (175) 131

sssp 55 15 24 320 70 (174) 71

sspp 84 21 24 320 69 (227) 64

pppp 387 21 56 128 97 (198) 20

table 3. Performance and accuracy of GPu algorithms for direct self-consistent field (scF) benchmarks.

Molecule time per direct scF iteration (seconds) electronic energy (atomic units) speedup

GPu GAMess GPu (32 bit) GAMess

Caffeine 0.168 4.4 –1605.91830 –1605.91825 26

Cholesterol 1.23 68.0 –3898.82158 –3898.82189 55

Buckyball 5.71 332.0 –10521.6414 –10521.6491 58

Taxol 4.45 279.6 –12560.6840 –12560.6828 63

Valinomycin 8.09 750.6 –20351.9855 –20351.9904 93



For very large molecules, the 32-bit precision provided by the Nvidia G80 series hardware isn’t sufficient because the total energy grows with the molecule’s size—in chemical problems, relative en-ergies are the primary objects of interest. In fact, using an incremental Fock matrix scheme to com-pute only the difference between Fock matrices in successive iterations and accumulating the Fock matrix on the CPU with dual-precision accuracy can improve the final result’s precision by up to a factor of 10. Nevertheless, we still require higher precision as the molecules being studied get larger. To maintain “chemical accuracy” for energy dif-ferences in 32-bit precision, we’re limited in prac-tice to molecules with less than 100 atoms.

Fortunately, Nvidia recently released the next generation of GPUs that supports 64-bit preci-sion in hardware. Because 32-bit arithmetic will remain significantly faster than 64-bit arithmetic, we anticipate that a mixed precision computational model will be ideal. In this case, the program will process a small fraction of ERIs (those with the largest absolute value) using 64-bit arithmetic and evaluate the vast majority of ERIs using the faster 32-bit arithmetic. Because the number of ERIs that require dual-precision accuracy scales linearly with system size, the impact of dual-precision cal-culations on overall computational performance

should be low for large molecules. The computa-tional methods presented here can be easily aug-mented to allow calculations within the framework of DFT, which is known to be significantly more accurate than HF theory. We’re currently imple-menting a general-purpose electronic structure code including DFT that runs almost entirely on the GPU in anticipation of upcoming hardware ad-vances. There is good reason to believe that these advances will enable the calculation of structures for small proteins directly from quantum mechan-ics—as well as computational design of new small-molecule drugs targeted to specific proteins—with unprecedented accuracy and speed.

referencesJ. Bolz et al., “Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” ACM Trans. Graph., vol. 22, no. 3, 2003, p. 917.

J. Hall, N. Carr, and J. Hart, GPU Algorithms for Radiosity and Subsurface Scattering, tech. report UIUCDCS-R-2003-2328, Univ. of Illinois, Urbana-Champaign, 2003.

K. Fatahalian, J. Sugerman, and P. Hanrahan, Graphics Hard-ware, T. Akenine-Moller and M. McCool, eds., Wellesley, 2004, p. 133.

A.G. Anderson, W.A. Goddard III, and P. Schroder, “Quan-tum Monte Carlo on Graphical Processing Units,” Computer Physics Comm., vol. 177, no. 3, 2007, p. 298.

A. Szabo and N.S. Ostlund, Modern Quantum Chemistry, Dover, 1996.

R.G. Parr and W. Yang, Density-Functional Theory of Atoms and Molecules, Oxford, 1989.

L.E. McMurchie and E.R. Davidson, “One- and Two-Electron Integrals Over Cartesian Gaussian Functions,” J. Computa-tional Physics, vol. 26, no. 2, 1978, p. 218

I.S. Ufimtsev and T.J. Martínez, “Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation,” J. Chemical Theory and Computation, vol. 4, no. 2, 2008, p. 222.

M.W. Schmidt et al. “General Atomic and Molecular Elec-tronic Structure System,” J. Computational Chemistry, vol. 14, no. 11, 1993, p. 1347.

ivan s. Ufimtsev is a graduate student and research assistant in the chemistry department at the Uni-versity of Illinois. His research interests include le-veraging non-traditional architectures for scientific computing. Contact him at [email protected].

Todd J. Martínez is the Gutgsell Chair of Chemistry at the University of Illinois. His research interests center on understanding the interplay between elec-tronic and nuclear motion in molecules, especially in the context of chemical reactions initiated by light. Martínez became interested in computer architec-tures and videogame design at an early age, writing and selling his first game programs (coded in assem-bler for the 6502 processor) in the early 1980s. Con-tact him at [email protected].

1.

2.

3.

4.

5.

6.

7.

8.

9.

Figure 3. Molecules used to test GPU performance. The set of molecules used spans the size range from 20 to 256 atoms.


Documents

Ivan S. Ufimtsev and Todd J. Martínez- Graphical Processing Units for Quantum Chemistry