Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Parallel implementation of TDSE on a Graphics Processing Unit
(GPU) platform
Cathal Ó BroinDublin City University
Together with...
Lampros Nikolopoulos,In collaboration with Ken Taylor
GPU Evolution
Compilers are now available in higher level languages (C and Fortran) for GPUs.
GPUs focus on parallelism.
Compared to CPUs, GPUs:● have less control units● more processing elements (Cores)● increased amount of on chip memory
Current GPU Example
NVIDIA Tesla Cards (with Fermi):● 448 Cores● 6GB of Memory● 0.5 Teraflops peak double precision performance● 148 GB/s bandwidth to the GPU
GPU Architecture
● Most graphics cards have a SIMD architecture● Graphics cards have a high amount of on board memory● GPUs aim for high throughput● Double precision is available
GPUs are used for highly parallel tasks.
What tasks are GPUs suitable for?
GPUs are suitable for tasks where:● the task can be broken up into groups of units● the units in the group execute the same instructions with different data.
But not for tasks that:● require high levels of communication within the task● require high levels of flow control such as if conditions within the code
The Physical Problem
An atomic or molecular system in an intense laser field fufills the TDSE:
The basis expansion approach
The problem can be changed to the form:
The Hamiltonian structure
Elements of the solution
The solution is of the form:
d
The Taylor Expansion Method (TE)
p
What is OpenCL
OpenCL (looks like C) is a language that generalizes the computational resources of a
computer.
OpenCL has:● portability between all supported architectures● combined use of CPU and GPU execution● compilation of code at runtime● massive hardware vendor support
kernel void MatrixMultiplication(const global double * a, const global double * b, global double * c, int n){
int LId, GroupId;int divcol, divrow; //Number of answers we must getdouble curr;
LId = get_local_id(0);GroupId = get_group_id(0);divcol = n/get_local_size(0);divrow = n/get_num_groups(0);
// Memory protection:if ((GroupId+1)*divrow > n)
divrow = n;
if (divcol*(LId + divcol) > n)divcol = n;
for (int k = 0; k < divrow; k++) {for (int j = 0; j < divcol; j++) {
curr = 0;for (int i = 0; i < n; i++)
curr += a[(GroupId*divrow+k)*n + i] * b[i*n + divcol*LId + j];c[(GroupId*divrow+k)*n + divcol*LId + j] = curr;
}}
}
Division of Work
Graphics card used
AMD FirePro 7800● Cost approx 750 Euro (pre-installed)● 1GB of total global memory● 32 KB per local memory unit● 64 KB of total constant Memory● 8 KB of private registers per processing element● 1440 Processing element● 64 processing elements per SIMD● 18 Compute Units● 400 Gigaflops maximum performance
Existing CPU code in C++
● Thoroughly tested on a number of systems (H, He, Mg etc...)
● Tested over the last ten years● Uses a NAG propagator
Results for N = 191
3 5 7 9 11 13 15 170
10
20
30
40
50
60
70
80
90
100
OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG
Angular Momentum
Tim
e (
se
c)
Results for N = 391
4 5 6 7 8 9 10 11 12 130
100
200
300
400
500
600
700
OpenCLNAG Propagator
Highest angular momenta value
Tim
e (
Se
c)
Further Work
Work will be undertaken to port the implementation to the NVIDIA specific CUDA so that it can operate at Ireland's High-Performance Computing Centre (ICHEC).
Work will be done to implement more sophisticated methods on the GPU.
END
OpenCL
NAG
N = 191, L = 12
On OpenCL
● Kernels are functions that are called from regular CPU based programs (host code).● Kernels are written in an OpenCL variant of C99.● Multiple instances of a kernel function are executed by different work items● Global synchronization of the memory to all work items can not be done except at the start of a new kernel function call.
Work Items
Each work item executes an instance of a Kernel.
A work item differs from a thread in that:● It's instruction set should be the same as the rest of the work group● There is no communication between work items out of the work group
Queueing in host code
● A problem can be broken up into tasks divided along synchronization points.●Each part of a task is then implemented in a kernel function●In host code, written in host languages such as C, C++ and Fortran, kernels are queued for execution.●Other items can also be queued, such as copying of buffers, or reading/writing buffers into host memory
Synchronization
● When one item in a queue is finished the next item queued can guarantee that it is executed after it.● Any changes to memory will be seen by the next item.● For the taylor expansion a synchronization point is required after the calculation of each successive derivative.
Results
0 5 10 15 20 25 30 35 40 450
50
100
150
200
250
300
350
400
OpenCL WGSZ:16OpenCL WGSZ:32OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG
GPU Execution Model