15
NVIDIA Research Parallelization of the Algorithm WHAM with NVIDIA CUDA Presented by Nicolò Savioli Academic year 2012/2013 Alma Mater Studiorum - University of Bologna Master's Degree in Biomedical Engineering Supervisor: Prof. Stefano Severi Co-Supervisor: Ing. Simone Furini

Slide tesi

Embed Size (px)

Citation preview

Page 1: Slide tesi

NVIDIA Research

Parallelization of the Algorithm WHAM with NVIDIA CUDA

Presented by Nicolò Savioli

Academic year 2012/2013

Alma Mater Studiorum - University of Bologna Master's Degree in Biomedical Engineering

Supervisor: Prof. Stefano Severi

Co-Supervisor: Ing. Simone Furini

Page 2: Slide tesi

© 2008 NVIDIA Corporation

Free-Energy:

The aim of this thesis is to implement the WHAM algorithm, originally implemented in CPU, for execution in GPU graphic cards. WHAM is an algorithm to estimate free energy profiles from Molecular Dynamics simulation.

Free energy estimates can be used to identify the affinity between molecules (Pharmacological Research).

The difference in Free Energy, between two configurations, 0 and 1 can be expressed as:

ΔA=A1−A0= log ()

ΔA=A1−A2=log( P1P0 )

ΔA=A1−A0=−k BT log (P 1/P 0 )

Fi = −∇iV(

r1,...,rN )

Fi =mi

ai i =1,...,N

Page 3: Slide tesi

© 2008 NVIDIA Corporation

Umbrella Sampling(Torrie and Valleau ,1977)

ξ ( r3N )

The problem is that Molecular Dynamics trajectories are limited in time (blocked in local minima of energy).

Biasing potential can be used to force the system to explore new configurations.

In Umbrella Sampling several simulations with different biasing potentials are run to explore the configuration space.

Hi(Γ )=H

0+W

i(ξ )

+ Biasing PotentialUnbiased Hamiltonian Biased Hamiltonian

W i (ξ )=k /2 (ξ−ξ 10 )2

Ion

Ion channel

Page 4: Slide tesi

© 2008 NVIDIA Corporation

Weighted Histogram Analysis Method (WHAM)

Our aim is calculate the properties of the original system (Unbiased) using the trajectories of biased simulations.

In the WHAM algorithm the probability of the unbiased system is calculated as a linear combination of R estimates obtained from R independent trajectories.

Minimization of the variance of the unbiased probability gives the following set of equations:

P u(ξh)=∑i=1

R

(ni /2 τ(ξh)

(n j/2 τ j(ξh))e−β(W j(ξh )− f j)

)P ib(ξh)

Number of samples inside bin h

Integrated autocorrelation timef i=−(1 /β) log(∑

h

P u(ξh)e−βW i(ξh ))

a) It starts with an arbitrary set of fi.

b) It use the first equation to calculate P(ξh).

c) It use second equation to update fi.

Page 5: Slide tesi

© 2008 NVIDIA Corporation

Why GPU?In recent years, new computational models have been developed in which new parallel architectures have allowed the improvement of computational abilities allowing numerical simulations to be more efficient and quicker.

One of the strategies used to parallelize mathematical model is the use of GPGPU (General Propose Computing on Graphics Processing Unit).

It was originally develop in image processing and now is also used in scientific simulations.

In recent years the computational capability of these architecture is increasing exponentially in comparison with CPU, and from 2007 NVIDIA has opened the possibility of programming GPUs with a specific language called CUDA.

Page 6: Slide tesi

© 2008 NVIDIA Corporation

The model of NVIDIA GPUs is SMID (Single Instruction, Multiple Data) composed of only a control unit that executes one instruction at a time by controlling more ALU that works in a synchronous manner.

\

GPU Architecture:

The GPU is connected to a host through a PCI-Express.

GPUs is constituted by a number of Multiprocessors (SM)

Registers,Execution Pipelines,Chaches.

Shared Memory (32KB)but fast !!!

Global Memory from 256MB to 6GB with Bandwidth 150 GB/s

8 or 16 Stream Processors (SP): (floating point,integer logic unit.)

Texture Memory implanting a texture 2D of polygonal model.

Page 7: Slide tesi

© 2008 NVIDIA Corporation

Example Code // Device code__global__ void VecAdd(float* A, float* B, float* C, int N){ //i) index that runs every thread to block int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i];} // Host codeint main(){ int N = ...; size_t size = N * sizeof(float); //a) Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); // Initialize input vectors ... //b) Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); //c) Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); //d) Group of threads are contained in blocks which in turn are contained in a grid must initialize number blocks for grid and thread block number int threadsPerBlock = 256; int blocksPerGrid =(N + threadsPerBlock - 1) / threadsPerBlock; //e) Invoke kernel VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); //f) Copy result from device memory to host memory h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); //g) Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); //h) Free host memory ...}

Page 8: Slide tesi

© 2008 NVIDIA Corporation

CUDA WHAM Considerations:The code consists of 11 files invoked as external functions and of a main file that initializes variables and execute the iterative algorithm.

The C++ function clock() was used to temporize the code. Optimizations have been made:

The Costant Memory was used to store the variables used more often.

In order to optimize the process of sums we used a Cuda technique called sum reduction. Each thread of block is synchronized and it produces a single result that is shared with another through Shared Memory.

__syncthreads()

Page 9: Slide tesi

© 2008 NVIDIA Corporation

Organization of the code: //invocation of the external CUDA function for Calculating Bias Bias(HIST.numhist, HIST.numwin,HIST.numdim,dev numhist,dev numdim,dev histmin,dev

center, dev harmrest, dev delta,dev step,dev numbin,dev U,dev numwham);

while((it < numit)&&(!converged)){ //invocation of the external CUDA function for Calculating P (New probability) NewProbabilities(cpu numhist[0],cpu numwin[0],dev numhist,dev numwin, dev numbinwin,dev g,dev numwham,dev U,dev F,dev denwham,dev Punnorm result); //invocation of the external CUDA function for Calculating new Sum summationP (cpu numhist[0],cpu numwin[0], dev numhist,dev numwin,dev U,dev UU,dev numwham); NewSum (dev numhist,cpu numwin[0],dev sumP,dev UU,dev Punnorm result,dev

numwham); //invocation of the external CUDA function for Calculating new constant F NewConstants(cpu numhist[0],cpu numwin[0],dev U,dev Punnorm result, dev sumP,dev F,dev numwham); //invocation of the external CUDA function for Calculating Normalization Constant NormFactor(cpu numhist[0],dev Punnorm result, sum normfactor for normprob and normcoef,dev numwham); //invocation of the external CUDA function for Normalization of P NormProbabilities (cpu numhist[0],dev sum normfactor for normprob and normcoef, Punnorm result,dev P,dev numwham); //invocation of the external CUDA function for Normalization of F NormCoefficient(cpu numwin[0],dev sum normfactor for normprob and normcoef ,dev F,dev sumP); //invocation of the external CUDA function for Convergence of the Math Model CheckConvergence(cpu numhist[0],dev P,dev P old,HIST.numgood, rmsd result,dev numwham);

//invocation of the external CUDA function for Calculating Free Energy ComputeEnergy(cpu numhist[0],dev P,dev kT,dev A result,dev P old,dev denwham); cudaMemcpy(cpu rmsd result,dev rmsd result,sizeof (float),cudaMemcpyDeviceToHost); if (cpu rmsd result[0] < tol) converged = true;//Is it converged ? it++; }

P u(ξh)=∑i=1

R

(ni /2 τ(ξh)

(n j /2 τ j(ξh))e−β(W j(ξh )− f j)

)P ib(ξh)

f i=−(1β)log (∑

h

P u(ξh)e−βW i(ξh))

W i (ξ )=k2

(ξ−ξ 10

)2

A (ξ )=−k BT log (P (ξ ) )

NF=∑h

P u(ξh)

Pu (ξh)=∑h

Pu(ξh)/NF

f i= f i+log(NF )

Conv=(P nu[ i ]−Po

u[ i ])2

Page 10: Slide tesi

© 2008 NVIDIA Corporation

Architectures used:

GPU WHAM was tested in different GPU architectures and compared with the corresponding CPU WHAM.

GT 9500 with Compute Capability of 1.1 (32 CUDA cores)

GT 320M with Compute Capability of 1.0 (24 CUDA cores)

Athlon X2 64 Dual Core

Intel i5 3400 Quad Core

Page 11: Slide tesi

© 2008 NVIDIA Corporation

Analysis of Convergence

GT 9500 (32 CUDA Cores) GT 320M (24 CUDA Cores)

They reach the same point of convergence !!!

Time [s]

KJ

/mo

l

Page 12: Slide tesi

© 2008 NVIDIA Corporation

MORE POWER !!!MORE POWER !!!

Performances almost double from compute capability 1.0 to compute capability 1.1.

Tim

e [

s]

Number of Iterations

GT 320M (24 CUDA Cores)

GT 9500 (32 CUDA Cores)

Performance:

Page 13: Slide tesi

© 2008 NVIDIA Corporation

Ratio with variable grid:

Constant with increasing size of the grid: there are no traffic problems with memory !!!

Number of Dim Grid

GP

U/C

PU

Tim

e [s

]

Page 14: Slide tesi

© 2008 NVIDIA Corporation

Conclusions:

For the first time the WHAM algorithm has been implemented in GPU.

The speed of execution of the GPU-WHAM algorithm increases with the speed of the graphics card used.

The GPU/CPU speed ratio is constant when changing the size of grid.

GPU-WHAM can execute in parallel with CPU calculations increasing the speed of execution.

Page 15: Slide tesi

© 2008 NVIDIA Corporation

Thank you for your attention!