1
0.0001 0.001 0.01 0.1 1 10 100 1000 10 100 1000 10000 Time (seconds) Matrix size n Timings for n x n matrix multiply conventional BLAS DGEMM vs. CUBLAS Asymptotic speeds: Conventional 2.9 Gflop/s CUDA, paged memory, 86.9 Gflop/s CUDA, pinned memory 60.0 Gflop/s (assuming pinned memory reused for ten calls) Crossing point, n=1845, t=4.5 Crossing point, n=798, t=0.40 Conventional CUDA paged CUDA pinned 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Matrix size n Matrix size m Matrix sizes in RASSCF DGEMM calls 1 -- 10 calls 10 -- 100 100 -- 1000 1000 -- 10000 --100000 --1000000 --10000000 --100000000 1 2 4 8 16 1 2 4 8 16 32 Speed-up n (nodes) Timings for parallelization over n nodes SEWARD Cho-CASSCF RASPT2 CASSCF Cho-SEWARD CASPT2 Cho-CASPT2 50 55 60 65 70 75 80 85 90 95 100 0.001 0.01 0.1 Accumulated time, % of total Time spent in each DGEMM call, sec When is CUDA worthwhile? 100% = 2090 seconds = total time for 109666421 calls 20% = total time for 1088 calls that were large enough to allow the overhead of CUDA calls 80% = total time for 109665333 calls that were shorter than 0.02 seconds, i.e., too short for CUDA. How does hardware development influence quantum chemistry? How does hardware development influence quantum chemistry? Mickaël Delcey, Per-Åke Malmqvist, Steven Vancoillie, Valera Veryazov Theoretical Chemistry, Chemical Center, P.O.B. 124, Lund 22100, Sweden, E-mail: [email protected] Parallel computing Multicores Multinodes SDD Graphical Processing Units (GPUs) CuBLAS Conclusions References Parallel computing Multicores Multinodes SDD Graphical Processing Units (GPUs) CuBLAS Conclusions References [1] W. A. de Jong, E. Bylaska, N.Govind, C. L. Janssen, K. Kowalski, T. Müller, I. M. B. Nielsen, H. J. J. van Dam, V. Veryazov and R. Lindh, Utilizing High Performance Computing for Chemistry: Parallel Compu- tational Chemistry, PCCP 12, 6896-6920, (2010). [2] .F. Aquilante, L. De Vico, N. Ferré, G. Ghigo, P.-Å Malmqvist, P. Neogrády, T. B. Pedersen, M. Pitoňak, M. Reiher, B. O. Roos, L. Serrano-Andrés, M. Urban, V. Veryazov, and R. Lindh. MOLCAS 7: The Next Generation. J. Comput. Chem., 31, 224-247, (2010) Hardware development is not driven by the needs of computational chemistry. At the same time, quite often these new technologies can not be used directly, and require some modifications of the application software. Substantial time can be needed to adjust computational codes and take advan- tage of new computer architectures. A modern GPU is a highly parallel and multi-threaded processor with very high flop performance and internal memory bandwidth. Nvidia has made these features useful for non-graphical tasks, creating an environment which includes code development tools, and the high performance libraries CuBLAS and CuFFT. A quantum chemistry software is typically rich in BLAS calls. This suggests a simple approach to GPU acceleration of QCh code: can we simply use a BLAS library that implements the BLAS calls by performing the corresponding CuBLAS ones? This idea is simple enough to be tested at once: we run MOLCAS code with the only modification that the matrix multiplication code was replaced by a wrapper to CuBLAS_DGEMM.Test calculations were done on a Tesla GPU, provided by STFC Daresbury Lab. However, the transfer of data between GPU card and CPU requires transfer via temporary allocated memory, and takes time. It is fastest if a portion of memory can be 'pinned' and reused for several calls Processors capacities have increased exponentially, following the famous Moore’s law. The development of computational units go now to a new dimension: instead of increasing the number of transistors on the same chip, various parallelization techniques are used: communication between CPUs via network, via bus on the same mainboard and finally units can be combined as computational cores on the same CPU. On the right are shown results of parallelization over nodes up to 24 nodes. The tests were made on LUNARC cluster with relatively slow Gigabit connection. Numerical gradi- ent shows a perfect scaling until the number of nodes is bigger than the number of dis- placements. Computing and usage of integrals needs small data transfer and thus shows an almost perfect scaling. The result of RASPT2 for a polyyne system is also shown, to be representative of small molecules with a big active space. The scaling is clearly better for this system since only the many-particle density matrices generation in the RASPT2/CASPT2 code is parallelized yet. Leading CPU manufacturers claim that in a near future, CPUs with 96 and more cores will be available. Quantum chemical codes should be adapted in order to use efficiently these technologies. Solid State Drives and cheap memory cards used e.g. in a camera or as USB memory stick have quite different read and write timings compared to conventional disks. A test MOLCAS job running on an external memory drive (connected via the USB 2.0 inter- face) showed a surprising result: RASSCF was running slightly faster and CASPT2 was only 17% slower, than using the conventional HDD. These hardware implementations have different transfer speed of data between computational units, different transfer speed between CPUs and memory and different cache size per CPU. So different computational algo- rithms can benefit differently from these hardware solutions. MOLCAS code[2] uses three kinds of parallelization algorithms. Independent calculations (Numerical Gradients), computing of parts of integrals (SEWARD+wave function calculation) and fine grain parallelization (multi-particle density matrices in CASSCF and CASPT2). Except for the last kind of parallel algorithms, the amount of data transferred between CPUs is fairly small.We selected benzene dimer as a test case, with 528 basis functions, which corresponds to an average calculation with MOLCAS code. Differences in scaling of different modules for parallelization over cores are shown on the left. Clearly, paralleliza- tion over cores is not efficient, although the different modules behave differently. Thus, Numerical Gradients and SEWARD see a small improvement by using more cores on the same CPU, while CASSCF and CASPT2, code are even slower in parallel. The situation is better with Cholesky decomposition technique, where I/O is less im- portant. Using of CuBLAS_DGEMM is clearly faster for huge matrices. RASSCF and CASPT2 codes spend up to 80% of the time in DGEMM. But do we have such large matrices in a typical calculation? Polyyne was used as a representative RASSCF calculation with 234 basis function and a 4/8/4 active space. The statistics of DGEMM calls for this test case shows that the percentage of big matrices is very small. Also, these sizes are asymmetric, while CUBLAS_DGEMM is more efficient for square matrix. Efficiently batched algorithms such as the one in MOLCAS have less advan- tages from GPU technology. Increasing of connection speed between CPU and GPU might change the conclusion completely. 0.5 1 2 1 2 4 Speed-up n (cores) Parallelization over n cores SEWARD Cho-CASSCF Num. Grad CASSCF Cho-SEWARD CASPT2 Cho-CASPT2

How does hardware development influence …dimension: instead of increasing the number of transistors on the same chip, various parallelization techniques are used: communication between

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How does hardware development influence …dimension: instead of increasing the number of transistors on the same chip, various parallelization techniques are used: communication between

0.0001

0.001

0.01

0.1

1

10

100

1000

10 100 1000 10000

Tim

e (s

econ

ds)

Matrix size n

Timings for n x n matrix multiplyconventional BLAS DGEMM vs. CUBLAS

Asymptotic speeds: Conventional 2.9 Gflop/sCUDA, paged memory, 86.9 Gflop/sCUDA, pinned memory 60.0 Gflop/s

(assuming pinned memory reused for ten calls)

Crossing point, n=1845, t=4.5

Crossing point, n=798, t=0.40

ConventionalCUDA pagedCUDA pinned

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

Mat

rix si

ze n

Matrix size m

Matrix sizes in RASSCF DGEMM calls 1 -- 10 calls10 -- 100

100 -- 1000 1000 -- 10000

--100000 --1000000

--10000000 --100000000

1

2

4

8

16

1 2 4 8 16 32

Spee

d-up

n (nodes)

Timings for parallelization over n nodes SEWARD

Cho-CASSCF RASPT2

CASSCF

Cho-SEWARD

CASPT2Cho-CASPT2

50 55 60 65 70 75 80 85 90 95

100

0.001 0.01 0.1

Acc

umul

ated

tim

e, %

of

tota

l

Time spent in each DGEMM call, sec

When is CUDA worthwhile?

100% = 2090 seconds = total time for 109666421 calls

20% = total time for 1088 calls that were large enough to allow the overhead of CUDA calls

80% = total time for 109665333 calls that were shorter than 0.02 seconds, i.e., too short for CUDA.

How does hardware development influence quantum chemistry?How does hardware development influence quantum chemistry?Mickaël Delcey, Per-Åke Malmqvist, Steven Vancoillie, Valera Veryazov

Theoretical Chemistry, Chemical Center, P.O.B. 124, Lund 22100, Sweden, E-mail: [email protected]

Parallel computing

Multicores

Multinodes

SDD

GraphicalProcessingUnits (GPUs)

CuBLAS

Conclusions

References

Parallel computing

Multicores

Multinodes

SDD

GraphicalProcessingUnits (GPUs)

CuBLAS

Conclusions

References [1] W. A. de Jong, E. Bylaska, N.Govind, C. L. Janssen, K. Kowalski, T. Müller, I. M. B. Nielsen, H. J. J. van Dam, V. Veryazov and R. Lindh, Utilizing High Performance Computing for Chemistry: Parallel Compu-tational Chemistry, PCCP 12, 6896-6920, (2010).[2] .F. Aquilante, L. De Vico, N. Ferré, G. Ghigo, P.-Å Malmqvist, P. Neogrády, T. B. Pedersen, M. Pitoňak, M. Reiher, B. O. Roos, L. Serrano-Andrés, M. Urban, V. Veryazov, and R. Lindh. MOLCAS 7: The Next Generation. J. Comput. Chem., 31, 224-247, (2010)

Hardware development is not driven by the needs of computational chemistry. At the same time, quite often these new technologies can not be used directly, and require some modifications of the application software. Substantial time can be needed to adjust computational codes and take advan-tage of new computer architectures.

A modern GPU is a highly parallel and multi-threaded processor with very high flop performance and internal memory bandwidth. Nvidia has made these features useful for non-graphical tasks, creating an environment which includes code development tools, and the high performance libraries CuBLAS and CuFFT. A quantum chemistry software is typically rich in BLAS calls.This suggests a simple approach to GPU acceleration of QCh code: can we simply use a BLAS library that implements the BLAS calls by performing the corresponding CuBLAS ones? This idea is simple enough to be tested at once: we run MOLCAS code with the only modification that the matrix multiplication code was replaced by a wrapper to CuBLAS_DGEMM.Test calculations were done on a Tesla GPU, provided by STFC Daresbury Lab. However, the transfer of data between GPU card and CPU requires transfer via temporary allocated memory, and takes time. It is fastest if a portion of memory can be 'pinned' and reused for several calls

Processors capacities have increased exponentially, following the famous Moore’s law. The development of computational units go now to a new dimension: instead of increasing the number of transistors on the same chip, various parallelization techniques are used: communication between CPUs via network, via bus on the same mainboard and finally units can be combined as computational cores on the same CPU.

On the right are shown results of parallelization over nodes up to 24 nodes. The tests were made on LUNARC cluster with relatively slow Gigabit connection. Numerical gradi-ent shows a perfect scaling until the number of nodes is bigger than the number of dis-placements. Computing and usage of integrals needs small data transfer and thus shows an almost perfect scaling. The result of RASPT2 for a polyyne system is also shown, to be representative of small molecules with a big active space. The scaling is clearly better for this system since only the many-particle density matrices generation in the RASPT2/CASPT2 code is parallelized yet. Leading CPU manufacturers claim that in a near future, CPUs with 96 and more cores will be available. Quantum chemical codes should be adapted in order to use efficiently these technologies.

Solid State Drives and cheap memory cards used e.g. in a camera or as USB memory stick have quite different read and write timings compared to conventional disks. A test MOLCAS job running on an external memory drive (connected via the USB 2.0 inter-face) showed a surprising result: RASSCF was running slightly faster and CASPT2 was only 17% slower, than using the conventional HDD.

These hardware implementations have different transfer speed of data between computational units, different transfer speed between CPUs and memory and different cache size per CPU. So different computational algo-rithms can benefit differently from these hardware solutions.MOLCAS code[2] uses three kinds of parallelization algorithms. Independent calculations (Numerical Gradients), computing of parts of integrals (SEWARD+wave function calculation) and fine grain parallelization (multi-particle density matrices in CASSCF and CASPT2). Except for the last kind of parallel algorithms, the amount of data transferred between CPUs is fairly small.We selected benzene dimer as a test case, with 528 basis functions, which corresponds to an average calculation with MOLCAS code.Differences in scaling of different modules for parallelization over cores are shown on the left. Clearly, paralleliza-tion over cores is not efficient, although the different modules behave differently. Thus, Numerical Gradients and SEWARD see a small improvement by using more cores on the same CPU, while CASSCF and CASPT2, code are even slower in parallel. The situation is better with Cholesky decomposition technique, where I/O is less im-portant.

Using of CuBLAS_DGEMM is clearly faster for huge matrices. RASSCF and CASPT2 codes spend up to 80% of the time in DGEMM. But do we have such large matrices in a typical calculation? Polyyne was used as a representative RASSCF calculation with 234 basis function and a 4/8/4 active space.The statistics of DGEMM calls for this test case shows that the percentage of big matrices is very small. Also, these sizes are asymmetric, while CUBLAS_DGEMM is more efficient for square matrix.Efficiently batched algorithms such as the one in MOLCAS have less advan-tages from GPU technology. Increasing of connection speed between CPU and GPU might change the conclusion completely.

0.5

1

2

1 2 4

Spee

d-up

n (cores)

Parallelization over n cores

SEWARDCho-CASSCF

Num. Grad

CASSCF

Cho-SEWARD

CASPT2

Cho-CASPT2