Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems
Filippo Spiga and Ivan GirottoIrish Centre for High-End Computing (ICHEC), Dublin - Ireland
Outline
• Motivations
• Starting point: compute using libraries
• The CPU-GPU PHIGEMM library
• Serial GPU-accelerated PWSCF
• Multiple GPUs 3D-FFT case-study
• What happened next
2Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
Motivations & goals
As involved in PRACE 1-IP project, WP 7.5
• Application porting to GPU
– through CUDA kernels
– through CUDA libraries
– through Directives(#)
• Evaluating new models/new paradigms
Main goals:
• Couple CUDA libraries and CPU libraries
• Create a library for transparently accelerate (part) of a complex application
• Hit, assess and discuss the bandwidth effects of CPU+GPU BLAS and FFT
� PRIMARY GOAL: going to production!
3Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
The CPU-GPU PHIGEMM library/1Functionalities
Starting points:
•GPU & CPU libraries � do not re-invent the wheel!
•NVIDIA GPU HPL (by M. Fatica)
The phiGEMM library…
•independent open-source project•targets all GEMM operations (S,C,D,Z)•C and FORTRAN interfaces•CUDA 3.x(#) and CUDA 4.x compatible•performs call-by-call profiling•supports multiple-GPU (CUDA 4.x)•current version 1.6(#)
4Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
The CPU-GPU PHIGEMM library/2the idea behind the scene
5Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
ICPU
GPU
H2D
unbalance
D2H
ICPU
GPUH2D D2H
S
The CPU-GPU PHIGEMM library/3splitting on both side
6Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
A
+
B C
×
A
+
B C
×
Full A, partial B and partial C on GPU
Partial A, full B and partial C on GPU
What about if they do not fit GPU memory?
The CPU-GPU PHIGEMM library/4recursion for big matrices
7Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
A
+
B C
×
A
× +
B C
A1
+
B C1
×
A2
+
B C2
×
Recursive splittingLoad-balancing penality
The CPU-GPU PHIGEMM library/5the splitting issue
It is all about load-balancing…
•try a guess � no good idea
•you know CPU and GPU peak perf � static approach
•through internal timing you can adjust the CPU/GPU work-load dynamically until a specific tolerance is achieved � dynamic approach(#)
It is all about effective bandwidth…
•even if you want, you cannot guess it•it is very variable from system to system
8Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
The CPU-GPU PHIGEMM library/6benchmark environment
High-end workstation
•2 six-cores Intel(R) Xeon(R) X5650 @ 2.67GHz•24 GBytes of HOST memory•2 Tesla C2050 with 3 GByte of DEVICE memory
During the tests…
•We used both one socket (6 cores) and two sockets (12 cores)•GPUs are on independent PCI channels•We “manually” reduced the GPU memory available
Feb 16, 2012
9Special Session on GPU Computing - PDP2012, Garching (DE)
The CPU-GPU PHIGEMM library/6DGEMM, 12 OMP, full GPU memory
Feb 16, 2012
10Special Session on GPU Computing - PDP2012, Garching (DE)
0
100
200
300
400
500
600
700
800
1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
GFlops
Dimension [M=N=K]
phiGEMM single- and multi-GPU performance
CUBLAS ThunkingCUBLAS peak
MKL
phiGEMM (1 GPU)phiGEMM (2 GPU)
Results of July 2011
The CPU-GPU PHIGEMM library/7ZGEMM, 6 OMP, 33% GPU memory (~1 GByte)
Feb 16, 2012
11Special Session on GPU Computing - PDP2012, Garching (DE)
0
100
200
300
400
500
600
700
800
1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
GFlops
Dimension [M=N=K]
phiGEMM single- and multi-GPU performance
CUBLAS ThunkingCUBLAS peak
MKL
phiGEMM (1 GPU)phiGEMM (2 GPU)
BONUS
Results of Jan 2012
The CPU-GPU PHIGEMM library/8DGEMM, m=n=k=25k, 12 OMP
Note that…
•Is there any BLAS CPU library faster then MKL? Just plug-it •Is there any BLAS GPU library faster than CUBLAS? Just plug-it
Feb 16, 2012
12Special Session on GPU Computing - PDP2012, Garching (DE)
0
200
400
600
800
1000
1200
1 GPU 2 GPU 4 GPU
GFlops
Number of GPUs
phiGEMM performance for large matrices
MKLCUBLAS
Scaling is NOT linear (as expected)
Results of July 2011
Quantum ESPRESSO
• Quantum ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modeling at the nanoscale.
– Density-Functional Theory, plane waves, pseudo-potential, …
• Very complex set of codes with shared modules
– ~500K lines of code (FORTRAN and C)
• Parallel code that run on several HPC architectures
– Linux clusters, CRAY supercomputers, IBM Power & BlueGene, NEC, …
• It supports MPI and OpenMP
• Two main packages: PWscf and CP
Feb 16, 2012
13Special Session on GPU Computing - PDP2012, Garching (DE)
(CUDA vs single-core, ~5x)
(CUDA vs single-core, ~20x)
(CUDA vs single-core, ~7x)
(CUDA vs single-core, ~9x)
Serial GPU-accelerated PWSCF/1
14Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
• It extensively uses libraries
– BLAS � phiGEMM
– LAPACK(#)
– FFTW (mainly in vloc_psi)
• It already well supports OpenMP
• Wall-time is mainly distributed within few routines
– addusdens
– newd
– vloc_psi
• AUSURF112 is a “fair enough” starting point
Serial GPU-accelerated PWSCF/2a deep look inside: where libraries are
Feb 16, 2012
15Special Session on GPU Computing - PDP2012, Garching (DE)
QUANTITIES
DIAGONALIZATION OF THEHAMILTONIAN MATRIX
(FOR EACH K�POINT)
CALCULATION OF THECHARGE DENSITY
CALCULATION OF THE NEW POTENTIALSNON�LOCAL POTENTIALS
CALCULATION OF THE NEW
FORCES ANDSTRESSES EQUAL
TO ZERO
ELSE
CALCULATIO N OF NEWATOMIC POSITIONS
(AND LATTICE PARAMETERS)
CALCULATION OF
(AND STRESSES)NEW FORCES
ELSE
CALCULATION OFWAVEFUNCTIONS
ORTHO�NORMALIZATION
INITIAL PHYSICAL
STRUCTURE OPTIMIZATION
WAVEFUNCTION
SETUP OF THE
SELF����CONSISTENCYACHIEVED
SELF�CONSISTENCY
3D-FFT + GEMM + LAPACK
3D-FFT 3D-FFT + GEMM
Serial GPU-accelerated PWSCF/3benchmarks - AUSURF112, serial
Feb 16, 2012
16Special Session on GPU Computing - PDP2012, Garching (DE)
3.49x
3.54x
5.54x7.81x
BONUS
Results of July 2011
The VLOC_PSI/3D-FFT issue
• The 3D-FFT has been recognized as a potential bottleneck for large calculations in all scientific DFT codes
• BLAS and FFT have two different compute footprints (see the paper)
17Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
Q: is it worth to allow VLOC_PSI to exploit multple GPU?
Q: how much I have to compute to compensate the data transfer?
In the parallel implementation VLOC_PSI acts over a distributed data, so the 3D-FFT grid is distributed (expected compulsory CPU�GPU data transfers)
� completely different strategy (GTC2012!)
Multiple GPUs batch 3D-FFT case-studyperformance evaluation
18Feb 16, 2012
Special Session on GPU Computing - PDP2012, Garching (DE)
0
2
4
6
8
10
12
4096 x 643 512 x 1283 64 x 2563
Time [s]
Number of GPUs
3D-FFT on multpile GPU
1 GPU2 GPU4 GPU
100
101
102
103
4096 x 643 512 x 1283 64 x 2563Time [s]
Number of GPUs
3D-FFT on multpile GPU vs single-core FFTW
CPU1 GPU2 GPU4 GPU
Conclusions
• We develop the phiGEMM library, a reliable library to handle CPU-GPU GEMM computations by “using” other libraries
– http://qe-forge.org/projects/phigemm
• We develop a serial GPU-accelerated version of PWscf
– http://qe-forge.org/projects/q-e (PRACE branch)
• We assess performance and bottlenecks of phiGEMM running a complex application using real benchmarks (PWscf)
• We hit new limits and challenges
Feb 16, 2012
19Special Session on GPU Computing - PDP2012, Garching (DE)
What happened/happening next
• Refinement of the splitting strategy and dynamic split factor
• Faced the PHIGEMM worst case scenario
– very long k, totally different splitting strategy
• Multi-GPU CUDA kernels for PWSCF serial calculation only
– Yes but only where it is really worth (BLAS ok, FFT depends)
• Full Parallel GPU-accelerated PWSCF (MPI+OpenMP+CUDA)
– transform distributed computations in local computations to use GPU
Feb 16, 2012
20Special Session on GPU Computing - PDP2012, Garching (DE)
Feb 16, 2012
21Special Session on GPU Computing - PDP2012, Garching (DE)