David Bailey, University of Manchester€¦ · David Bailey. 17. CHEP ‘07 USING BROOK. MRSC2008 USING CUDA. 0 200 400 600 800 1000 1200 1400 0 1000000 2000000 3000000 4000000. Time

David Bailey, University of Manchester

2David Bailey

Need to simulate the passage of beams of particles through an accelerator

Beams typically contain 1011 particles at any one timeAlso need to figure out where particles are lost

Too many lost in one place and you end up with a radioactive accelerator which is bad!

Calculations involve tracking of many particles’ trajectories through the accelerating, bending and focusing elements of an accelerator to understand the spatial extent of the beam and the stability of motion

3David Bailey

A typical accelerator is composed of RF cavities (accelerating elements) and magnetic elements which bend and focus the beamParticle transport through an accelerator is based on maps acting on phase space vectors

Each particle is described by 6 “co-ordinates”Maps are matrices acting on these vectors

x

y

t

xpyptp

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

4David Bailey

These are canonical variables in the Hamiltonian sense

Longitudinal phase space

An accelerator element acts on the phase space vector 6 6 6

1 1 1j j jk k jkl k l

k k l

z x R x T x x= = =

= Δ + +∑ ∑∑Here, R represents the linear part of the transport map and T the second order corrections.

2 2

1 0 0 0 00 1 0 0 0 00 0 1 0 00 0 0 1 0 0

0 0 0 0 1

0 0 0 0 0 1

L

L

Lβ γ

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Typical matrices are sparseThis represents transport through a drift of length L (i.e. no magnets or accelerating structures)

5David Bailey

Reference is to old fortran codeMethodical Accelerator Design (MAD)Developed by CERN, dates back to 1984Still used today

Problem of dealing with large numbers of particles in a reasonable time

In practice, run many smaller jobs and merge the results

6David Bailey

Start with something simple...

7David Bailey

Available from Stanford Graphics LaboratoryThough showing its age now...

Works with the majority of programmable GPUsUses the ‘Stream’ concept to define data flow between host and device

Streams implemented as an extension to regular CResults obtained using an nVidia 7900 GT

8David Bailey

Brook *.br file compiled using the Brook Runtime C Compiler (BRCC)

Works with both DirectX and OpenGL back-endsBRCC produces C code with extensive headers

Without expert knowledge, stream sizes are non-dynamicWe stuck to fixed sizes

9David Bailey

Available from nVidiaWorks only GPUs upwards of the 8400

Future compatibility with nVidia’s TESLA computing solutions

Future native double-precision supportEasy to set up

10David Bailey

GPMAD is mostly C++GPU processing for computationally expensive matrix-vector multiplication plus second order perturbations

Second-order terms exclude the use of integrated CUBLAS Matrix-Vector library

No complicated optimisations done yetStill process each element sequentially and copy results back to host memory each time

Scope for further optimisationsResults obtained from a laptop...

Geforce 8600MHave a Geforce 8800 Ultra board still to try...

11David Bailey

#include <cppIntegration_kernel.cu>

Forward external declaration, required for interface with the rest of the code in C++

extern "C" void theCoreGPU(int number_of_particles, float *x,...

Initialise the GPU for computational purposesCUT_DEVICE_INIT();

Calculate and define parameters required for GPU memory allocation and copying

unsigned int num_threads = 256;unsigned int num_blocks = number_of_particles/num_threads;

unsigned int mem_size = sizeof( float) * number_of_particles;

Allocate device memory and associated pointersfloat* d_i_x;CUDA_SAFE_CALL( cudaMalloc( (void**) &d_i_x, mem_size));

12David Bailey

Copy from host (main) memory to device (GPU) memoryCUDA_SAFE_CALL( cudaMemcpy(d_i_x,x,mem_size,cudaMemcpyHostToDevice));

Define execution parameters – how the processes map onto the multiprocessorsdim3 grid( num_blocks, 1, 1);dim3 threads( num_threads, 1, 1);

Execute the kerneltestKernel<<< grid, threads >>>( d_i_x, d_i_px...

Copy result back to main memoryCUDA_SAFE_CALL( cudaMemcpy(x,d_i_x,mem_size,cudaMemcpyDeviceToHost));

De-allocate GPU memoryCUDA_SAFE_CALL(cudaFree(d_i_x));

13David Bailey

Declare the function as ‘global’ (accessible to both host and device)__global__ void testKernel( float* x, float* px...

Access the thread identification based on the setup parametersconst unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

Copy host memory to devicefloat XDATA=x[tid];

Allocate variable to store the resultfloat _XDATA

Perform computation_XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA + T111*XDATA*XDATA;

Return result to host memoryx[tid] = _XDATA;

14David Bailey

CHEP ‘07 USING BROOK MRSC2008 USING CUDA

-0.008

-0.006

-0.004

-0.002

0

0.002

0.004

0.006

0.008

03.

12 3.6

7.08

8.15

11.1

413

.116

.32

20.2

621

.66

26.7

533

.16

38.8

640

.46

46.7

250

.35

51.8

558

.79

60.2

9

x po

sitio

n [m

]

Distance along Beamline [m]

MAD (Reference) Tracking GPMAD Tracking-0.008

-0.006

-0.004

-0.002

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.00

3.12

3.60

7.08

8.15

11.1

413

.10

16.3

220

.26

21.6

626

.75

33.2

138

.86

40.4

646

.77

50.3

551

.85

58.7

960

.29

xPo

sitio

n

Distance Along Beamline

GPMAD MAD

15David Bailey


-0.0015

-0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

03.

324.

467.

7910

.813

.116

.72

20.6

626

.65

28.2

538

.86

41.3

448

.94

51.2

558

.79

60.2

9

px v

alue

Distance along Beamline [m]

MAD (Reference) Tracking GPMAD Tracking-0.003

-0.0025

-0.002

-0.0015

-0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

03.

12 3.6

7.08

8.15

11.1

413

.116

.32

20.2

621

.66

26.7

533

.21

38.8

640

.46

46.7

750

.35

51.8

558

.79

60.2

9Px V

alue

Distance Along Beamline

GPMADMAD

16David Bailey


0

1000

2000

3000

4000

5000

6000

7000

8000

-0.0

07

-0.0

05

-0.0

03

-0.0

01

0.00

1

0.00

3

0.00

5

0.00

7

0.00

9

0.01

1

0.01

3

0.01

5

0.01

7

0.01

9

0.02

1

0.02

3

Num

ber o

f Par

ticle

s

'x' Distribution

GPMAD MAD

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

-0.0

25-0

.022

-0.0

19

-0.0

16-0

.013

-0.0

1-0

.007

-0.0

04-0

.001

0.00

2

0.00

50.

008

0.01

10.

014

0.01

70.

020.

023

GPMAD

MAD

17David Bailey


0

200

400

600

800

1000

1200

1400

0 1000000 2000000 3000000 4000000

Tim

e Ta

ken

/ s

Number of Particles

GPMAD Time MAD Time

0

200

400

600

800

1000

1200

1400

0 1000000 2000000 3000000 4000000

Number of Particles

GPMAD Time MAD Time

Overall performance gain factor 418David Bailey

David Bailey 19

0

50

100

150

200

250

300

350

400

450

0.17

BFQ

UD

2.87

FK

3.48

VK

3.6

DFT

60

6.62

BEN

D1

7.2

DFT

120

7.79

MPS

10.5

2

D

FT20

0

11.1

4

D

QU

AD

KIC

K2

12.9

2

M

S

13.3

BTS

C1

16.2

2

H

CO

LL1

17.2

2

PS

TRA

IGH

T1

20.2

6

PS

TRA

IGH

T2

21.1

6

PS

TRA

IGH

T3

21.8

6

BT

SC3

26.6

5

BT

SBPM

3

27.6

5

PS

TRA

IGH

T5

33.1

6

PS

TRA

IGH

T6A

37.4

1

PB

END

1

38.9

6

PS

TRA

IGH

T7B

40.2

6

FQ

UD

4

41.3

4

W

ALL

M

46.7

2

PS

TRA

IGH

T9B

50.1

5

PS

TRA

IGH

T10A

50.3

5

BT

SBPM

5

51.6

5

D

QU

D5

55.7

3

PB

END

3

58.7

9

BT

SBPM

6

59.7

9

PS

TRA

IGH

T14

60.4

9

BT

SC7

68.0

4

SR

SEPT

UM

Num

ber o

f Par

ticle

s Los

t per

100

00 in

put p

artic

les

Total Particle loss = 3.99 x 109

electrons per second.

GPUs show promiseSpeed improvement by a factor of 4 without doing anything really cleverAccuracy looks sufficient

Interested to see what double precision will bringOnly demonstrated a simple simulation so far

Need to move on to a complete ring and many turnsPrecision may become an issue here

Would be great to have a common interface to different hardware

Not in a position to dictate to the community what they should buy

20David Bailey

Documents

David Bailey, University of Manchester€¦ · David Bailey. 17. CHEP ‘07 USING BROOK. MRSC2008 USING CUDA. 0 200 400 600 800 1000 1200 1400 0 1000000 2000000 3000000 4000000. Time