Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Parallel implementation of TDSE on a Graphics Processing Unit

(GPU) platform

Cathal Ó BroinDublin City University

Together with...

Lampros Nikolopoulos,In collaboration with Ken Taylor

GPU Evolution

Compilers are now available in higher level languages (C and Fortran) for GPUs.

GPUs focus on parallelism.

Compared to CPUs, GPUs:● have less control units● more processing elements (Cores)● increased amount of on chip memory

Current GPU Example

NVIDIA Tesla Cards (with Fermi):● 448 Cores● 6GB of Memory● 0.5 Teraflops peak double precision performance● 148 GB/s bandwidth to the GPU

GPU Architecture

● Most graphics cards have a SIMD architecture● Graphics cards have a high amount of on board memory● GPUs aim for high throughput● Double precision is available

GPUs are used for highly parallel tasks.

What tasks are GPUs suitable for?

GPUs are suitable for tasks where:● the task can be broken up into groups of units● the units in the group execute the same instructions with different data.

But not for tasks that:● require high levels of communication within the task● require high levels of flow control such as if conditions within the code

The Physical Problem

An atomic or molecular system in an intense laser field fufills the TDSE:

The basis expansion approach

The problem can be changed to the form:

The Hamiltonian structure

Elements of the solution

The solution is of the form:

d

The Taylor Expansion Method (TE)

p

What is OpenCL

OpenCL (looks like C) is a language that generalizes the computational resources of a

computer.

OpenCL has:● portability between all supported architectures● combined use of CPU and GPU execution● compilation of code at runtime● massive hardware vendor support

kernel void MatrixMultiplication(const global double * a, const global double * b, global double * c, int n){

int LId, GroupId;int divcol, divrow; //Number of answers we must getdouble curr;

LId = get_local_id(0);GroupId = get_group_id(0);divcol = n/get_local_size(0);divrow = n/get_num_groups(0);

// Memory protection:if ((GroupId+1)*divrow > n)

divrow = n;

if (divcol*(LId + divcol) > n)divcol = n;

for (int k = 0; k < divrow; k++) {for (int j = 0; j < divcol; j++) {

curr = 0;for (int i = 0; i < n; i++)

curr += a[(GroupId*divrow+k)*n + i] * b[i*n + divcol*LId + j];c[(GroupId*divrow+k)*n + divcol*LId + j] = curr;

}}

}

Division of Work

Graphics card used

AMD FirePro 7800● Cost approx 750 Euro (pre-installed)● 1GB of total global memory● 32 KB per local memory unit● 64 KB of total constant Memory● 8 KB of private registers per processing element● 1440 Processing element● 64 processing elements per SIMD● 18 Compute Units● 400 Gigaflops maximum performance

Existing CPU code in C++

● Thoroughly tested on a number of systems (H, He, Mg etc...)

● Tested over the last ten years● Uses a NAG propagator

Results for N = 191

3 5 7 9 11 13 15 170

10

20

30

40

50

60

70

80

90

100

OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG

Angular Momentum

Tim

e (

se

c)

Results for N = 391

4 5 6 7 8 9 10 11 12 130

100

200

300

400

500

600

700

OpenCLNAG Propagator

Highest angular momenta value

Tim

e (

Se

c)

Further Work

Work will be undertaken to port the implementation to the NVIDIA specific CUDA so that it can operate at Ireland's High-Performance Computing Centre (ICHEC).

Work will be done to implement more sophisticated methods on the GPU.

END

OpenCL

NAG

N = 191, L = 12

On OpenCL

● Kernels are functions that are called from regular CPU based programs (host code).● Kernels are written in an OpenCL variant of C99.● Multiple instances of a kernel function are executed by different work items● Global synchronization of the memory to all work items can not be done except at the start of a new kernel function call.

Work Items

Each work item executes an instance of a Kernel.

A work item differs from a thread in that:● It's instruction set should be the same as the rest of the work group● There is no communication between work items out of the work group

Queueing in host code

● A problem can be broken up into tasks divided along synchronization points.●Each part of a task is then implemented in a kernel function●In host code, written in host languages such as C, C++ and Fortran, kernels are queued for execution.●Other items can also be queued, such as copying of buffers, or reading/writing buffers into host memory

Synchronization

● When one item in a queue is finished the next item queued can guarantee that it is executed after it.● Any changes to memory will be seen by the next item.● For the taylor expansion a synchronization point is required after the calculation of each successive derivative.

Results

0 5 10 15 20 25 30 35 40 450

50

100

150

200

250

300

350

400

OpenCL WGSZ:16OpenCL WGSZ:32OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG

GPU Execution Model

Documents

Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and