Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
David Bailey, University of Manchester
2David Bailey
Need to simulate the passage of beams of particles through an accelerator
Beams typically contain 1011 particles at any one timeAlso need to figure out where particles are lost
Too many lost in one place and you end up with a radioactive accelerator which is bad!
Calculations involve tracking of many particles’ trajectories through the accelerating, bending and focusing elements of an accelerator to understand the spatial extent of the beam and the stability of motion
3David Bailey
A typical accelerator is composed of RF cavities (accelerating elements) and magnetic elements which bend and focus the beamParticle transport through an accelerator is based on maps acting on phase space vectors
Each particle is described by 6 “co-ordinates”Maps are matrices acting on these vectors
x
y
t
xpyptp
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
4David Bailey
These are canonical variables in the Hamiltonian sense
Longitudinal phase space
An accelerator element acts on the phase space vector 6 6 6
1 1 1j j jk k jkl k l
k k l
z x R x T x x= = =
= Δ + +∑ ∑∑Here, R represents the linear part of the transport map and T the second order corrections.
2 2
1 0 0 0 00 1 0 0 0 00 0 1 0 00 0 0 1 0 0
0 0 0 0 1
0 0 0 0 0 1
L
L
Lβ γ
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
Typical matrices are sparseThis represents transport through a drift of length L (i.e. no magnets or accelerating structures)
5David Bailey
Reference is to old fortran codeMethodical Accelerator Design (MAD)Developed by CERN, dates back to 1984Still used today
Problem of dealing with large numbers of particles in a reasonable time
In practice, run many smaller jobs and merge the results
6David Bailey
Start with something simple...
7David Bailey
Available from Stanford Graphics LaboratoryThough showing its age now...
Works with the majority of programmable GPUsUses the ‘Stream’ concept to define data flow between host and device
Streams implemented as an extension to regular CResults obtained using an nVidia 7900 GT
8David Bailey
Brook *.br file compiled using the Brook Runtime C Compiler (BRCC)
Works with both DirectX and OpenGL back-endsBRCC produces C code with extensive headers
Without expert knowledge, stream sizes are non-dynamicWe stuck to fixed sizes
9David Bailey
Available from nVidiaWorks only GPUs upwards of the 8400
Future compatibility with nVidia’s TESLA computing solutions
Future native double-precision supportEasy to set up
10David Bailey
GPMAD is mostly C++GPU processing for computationally expensive matrix-vector multiplication plus second order perturbations
Second-order terms exclude the use of integrated CUBLAS Matrix-Vector library
No complicated optimisations done yetStill process each element sequentially and copy results back to host memory each time
Scope for further optimisationsResults obtained from a laptop...
Geforce 8600MHave a Geforce 8800 Ultra board still to try...
11David Bailey
#include <cppIntegration_kernel.cu>
Forward external declaration, required for interface with the rest of the code in C++
extern "C" void theCoreGPU(int number_of_particles, float *x,...
Initialise the GPU for computational purposesCUT_DEVICE_INIT();
Calculate and define parameters required for GPU memory allocation and copying
unsigned int num_threads = 256;unsigned int num_blocks = number_of_particles/num_threads;
unsigned int mem_size = sizeof( float) * number_of_particles;
Allocate device memory and associated pointersfloat* d_i_x;CUDA_SAFE_CALL( cudaMalloc( (void**) &d_i_x, mem_size));
12David Bailey
Copy from host (main) memory to device (GPU) memoryCUDA_SAFE_CALL( cudaMemcpy(d_i_x,x,mem_size,cudaMemcpyHostToDevice));
Define execution parameters – how the processes map onto the multiprocessorsdim3 grid( num_blocks, 1, 1);dim3 threads( num_threads, 1, 1);
Execute the kerneltestKernel<<< grid, threads >>>( d_i_x, d_i_px...
Copy result back to main memoryCUDA_SAFE_CALL( cudaMemcpy(x,d_i_x,mem_size,cudaMemcpyDeviceToHost));
De-allocate GPU memoryCUDA_SAFE_CALL(cudaFree(d_i_x));
13David Bailey
Declare the function as ‘global’ (accessible to both host and device)__global__ void testKernel( float* x, float* px...
Access the thread identification based on the setup parametersconst unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
Copy host memory to devicefloat XDATA=x[tid];
Allocate variable to store the resultfloat _XDATA
Perform computation_XDATA = R11*XDATA + R12*PXDATA + R16*DPDATA + T111*XDATA*XDATA;
Return result to host memoryx[tid] = _XDATA;
14David Bailey
CHEP ‘07 USING BROOK MRSC2008 USING CUDA
-0.008
-0.006
-0.004
-0.002
0
0.002
0.004
0.006
0.008
03.
12 3.6
7.08
8.15
11.1
413
.116
.32
20.2
621
.66
26.7
533
.16
38.8
640
.46
46.7
250
.35
51.8
558
.79
60.2
9
x po
sitio
n [m
]
Distance along Beamline [m]
MAD (Reference) Tracking GPMAD Tracking-0.008
-0.006
-0.004
-0.002
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.00
3.12
3.60
7.08
8.15
11.1
413
.10
16.3
220
.26
21.6
626
.75
33.2
138
.86
40.4
646
.77
50.3
551
.85
58.7
960
.29
xPo
sitio
n
Distance Along Beamline
GPMAD MAD
15David Bailey
CHEP ‘07 USING BROOK MRSC2008 USING CUDA
-0.0015
-0.001
-0.0005
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
03.
324.
467.
7910
.813
.116
.72
20.6
626
.65
28.2
538
.86
41.3
448
.94
51.2
558
.79
60.2
9
px v
alue
Distance along Beamline [m]
MAD (Reference) Tracking GPMAD Tracking-0.003
-0.0025
-0.002
-0.0015
-0.001
-0.0005
0
0.0005
0.001
0.0015
0.002
03.
12 3.6
7.08
8.15
11.1
413
.116
.32
20.2
621
.66
26.7
533
.21
38.8
640
.46
46.7
750
.35
51.8
558
.79
60.2
9Px V
alue
Distance Along Beamline
GPMADMAD
16David Bailey
CHEP ‘07 USING BROOK MRSC2008 USING CUDA
0
1000
2000
3000
4000
5000
6000
7000
8000
-0.0
07
-0.0
05
-0.0
03
-0.0
01
0.00
1
0.00
3
0.00
5
0.00
7
0.00
9
0.01
1
0.01
3
0.01
5
0.01
7
0.01
9
0.02
1
0.02
3
Num
ber o
f Par
ticle
s
'x' Distribution
GPMAD MAD
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-0.0
25-0
.022
-0.0
19
-0.0
16-0
.013
-0.0
1-0
.007
-0.0
04-0
.001
0.00
2
0.00
50.
008
0.01
10.
014
0.01
70.
020.
023
GPMAD
MAD
17David Bailey
CHEP ‘07 USING BROOK MRSC2008 USING CUDA
0
200
400
600
800
1000
1200
1400
0 1000000 2000000 3000000 4000000
Tim
e Ta
ken
/ s
Number of Particles
GPMAD Time MAD Time
0
200
400
600
800
1000
1200
1400
0 1000000 2000000 3000000 4000000
Number of Particles
GPMAD Time MAD Time
Overall performance gain factor 418David Bailey
David Bailey 19
0
50
100
150
200
250
300
350
400
450
0.17
BFQ
UD
2.87
FK
3.48
VK
3.6
DFT
60
6.62
BEN
D1
7.2
DFT
120
7.79
MPS
10.5
2
D
FT20
0
11.1
4
D
QU
AD
KIC
K2
12.9
2
M
S
13.3
BTS
C1
16.2
2
H
CO
LL1
17.2
2
PS
TRA
IGH
T1
20.2
6
PS
TRA
IGH
T2
21.1
6
PS
TRA
IGH
T3
21.8
6
BT
SC3
26.6
5
BT
SBPM
3
27.6
5
PS
TRA
IGH
T5
33.1
6
PS
TRA
IGH
T6A
37.4
1
PB
END
1
38.9
6
PS
TRA
IGH
T7B
40.2
6
FQ
UD
4
41.3
4
W
ALL
M
46.7
2
PS
TRA
IGH
T9B
50.1
5
PS
TRA
IGH
T10A
50.3
5
BT
SBPM
5
51.6
5
D
QU
D5
55.7
3
PB
END
3
58.7
9
BT
SBPM
6
59.7
9
PS
TRA
IGH
T14
60.4
9
BT
SC7
68.0
4
SR
SEPT
UM
Num
ber o
f Par
ticle
s Los
t per
100
00 in
put p
artic
les
Total Particle loss = 3.99 x 109
electrons per second.
GPUs show promiseSpeed improvement by a factor of 4 without doing anything really cleverAccuracy looks sufficient
Interested to see what double precision will bringOnly demonstrated a simple simulation so far
Need to move on to a complete ring and many turnsPrecision may become an issue here
Would be great to have a common interface to different hardware
Not in a position to dictate to the community what they should buy
20David Bailey