Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
Xavier Martorell Barcelona Supercomputing Center
and Universitat Politècnica de Catalunya
NVIDIA Global Technology Conference (GTC'13)San José, CaliforniaMarch 18-21, 2013
OmpSs: Leveraging CUDA and OpenCLto Exploit Heterogeneous Clusters of
Hardware Accelerators
Motivation
• Variety of accelerators– GPUs
– APUs
– MIC
• Productivity is low, due to the different programming languages– Programmers need effective solutions
• OpenACC, OpenMP, OmpSs
NVidia K20
AMD Fusion Intel Xeon
Phi
3
Cholesky• Decomposes N×N positive definite symmetric
matrix A as: A = LU
– L is lower triangular
– U is upper triangular matrix U, and L = UT
void Cholesky( float **A, int nb, int bs ) { int i, j, k; for (k=0; k<nb; k++) { potrf_tile (A[k*nb+k], bs); for (i=k+1; i<nb; i++) trsm_tile (A[k*nb+k], A[k*nb+i], bs);
for (i=k+1; i<nb; i++) { for (j=k+1; j<i; j++) gemm_tile (A[k*nb+i], A[k*nb+j], A[j*nb+i], bs); syrk_tile (A[k*nb+i], A[i*nb+i], bs); } }}
BS
BS
NB
NB
BS
BS
A
4
OpenMP tasks @SMPs for (k = 0; k < nb; k++) { potrf_tile (Ah[k*nb + k], bs); for (i = k + 1; i < nb; i++) {#pragma omp task shared(Ah) trsm_tile(Ah[k*nb + k], Ah[k*nb + i], bs); }#pragma omp taskwait for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) {#pragma omp task shared(Ah) gemm_tile(Ah[k*nb + i], Ah[k*nb + j], Ah[j*nb + i], bs); }#pragma omp task shared(Ah) syrk_tile(Ah[k*nb + i], Ah[i*nb + i], bs); }#pragma omp taskwait }
• OpenMP tasksneed the use oftaskwaits, thatlimits parallelism
5
OpenMP tasks @SMPs
- OpenMP taskwait- Imbalance- Idle time (light blue in trace)
- How can we also exploit heterogeneity?
PARAVER VISUALIZATION TRACE
6
Outline
• Motivation
• Cholesky @OmpSs– Evaluation on SMP and GPUs
• OmpSs @OpenCL&CUDA– Characteristics
– Evaluation
– Coding applications
• Conclusions & future work
7
Cholesky @OmpSs
#pragma omp target device (smp) copy_deps#pragma omp task inout([bs*bs]A)void potrf_tile(REAL *A, int bs);
#pragma omp target device (cuda) implements (potrf_tile) copy_deps#pragma omp task inout([bs*bs]A)void potrf_tile_gpu(REAL *A, int bs);
• OmpSs– User functions can also be specified as tasks
– Data directionality hints• Compiler generates information for the runtime
• Runtime performs required data transfers
– Provide implementations for different targets
8
Cholesky @OmpSs#pragma omp target device (smp) copy_deps#pragma omp task input([bs*bs]A, [bs*bs]B) inout([bs*bs]C)void gemm_tile (REAL *A, REAL *B, REAL *C, int bs);
#pragma omp target device (cuda) implements (gemm_tile) copy_deps#pragma omp task input([bs*bs]A, [bs*bs]B) inout([bs*bs]C)void gemm_tile_gpu(REAL *A, REAL *B, REAL *C, int bs);
void gemm_tile_gpu(REAL *A, REAL *B, REAL *C, int bs){ unsigned char TR = 'T', NT = 'N'; REAL DONE = 1.0, DMONE = -1.0;
cudaStream_t stream = nanos_get_kernel_execution_stream(); cublasSetKernelStream(stream); cublasSgemm (NT, TR, bs, bs, bs, DMONE, A, bs, B, bs, DONE, C, bs);}
✔ Leveraging the use of existing kernels✔ CUDA, CUBLAS, OpenCL
9
Cholesky @OmpSsBS
BS
NB
NB
BS
BS
void cholesky(REAL** A, int nb, int bs){ for (k = 0; k < nb; k++) { // Diagonal Block factorization
potrf_tile(A[k*nb + k], bs); // spawn
// Triangular systems for (i = k + 1; i < nb; i++) { trsm_tile(A[k*nb + k], A[k*nb + i], bs); // spawn }
// Update trailing matrix for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) { // spawn gemm_tile(A[k*nb + i], A[k*nb + j], A[j*nb + i], bs); } syrk_tile(A[k*nb + i], A[i*nb + i], bs); // spawn } }}
A
10
Cholesky @SMPs• Task dependences allow starting iterations earlier
✔ Exploitation of Critical Path ✔ OmpSs 14% faster
OpenMPtasks
OmpSstasks withdeps
11
Cholesky @OpenMP 4.0BS
BS
NB
NB
BS
BS
void cholesky(REAL** A, int nb, int bs){ for (k = 0; k < nb; k++) {#pragma omp task depend (inout: A[k*nt + k])
potrf_tile(A[k*nb + k], bs);
for (i = k + 1; i < nb; i++) {#pragma omp task depend (in: A[k*nt + k]) \ depend (inout: A[k*nt + i]) trsm_tile(A[k*nb + k], A[k*nb + i], bs); } for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) {#pragma omp task depend (in: A[k*nb + i], A[k*nb + j]) \ depend (inout: A[j*nb + i) gemm_tile(A[k*nb + i], A[k*nb + j], A[j*nb + i], bs); }#pragma omp task depend (in: A[i*nt + k]) \ depend (inout: A[k*nt + k]) syrk_tile(A[k*nb + i], A[i*nb + i], bs); } }}
A
12
C/C++/Fortran
Execution environment
• Mercurium compiler 2.0
• Nanos++ 1.0
• gcc 4.6
• Nvidia CUDA 4.1
13
Execution environment
• Single node on MinoTauro@BSC– 2x six-core Intel Xeon E5649 (12 cores)
• 2.53 Ghz, 12MB L3
– 2x Nvidia Tesla M2090 GPUs• 6 Gbytes Global Memory, each
– 24 Gbytes RAM
14
1 2 4 8 120
50
100
150
200
250
300
350
OpenMP tasksOmpSs tasksOmpSs+1 GPUOmpSs+2 GPUs
Number of cores
Gflo
p/s
1 2 4 8 120
100
200
300
400
500
600
700
800
OpenMP tasksOmpSs tasksOmpSs+1 GPUOmpSs+2 GPUs
Number of cores
Gflo
p/s
Cholesky @GPUs4096 x 4096Block size
CPU: 256 x 256GPU: 512 x 512
16384 x 16384Block size CPU: 256 x 256 GPU: 4096 x 4096
15
16 24 32 40 480
100
200
300
400
500
600
700
800
OpenMP tasksOmpSs tasksPeak
Number of cores
Gflo
p/s
Cholesky @large SMPs
40960 x 40960Block size
CPU: 640 x 640
On AMD Opteron 6172 2.1 Ghz
✔ Exploitation of Critical Path ✔ OmpSs 65% faster at 48 cores
16
Outline
• Motivation
• Cholesky @OmpSs– Evaluation on SMP and GPUs
• OmpSs @OpenCL&CUDA– Characteristics
– Evaluation
– Coding applications
• Conclusions & future work
17
OmpSs @OpenCL&CUDA
• Thread-pool model– SMP threads
– Device representative thread
• Tasks labeled with "target"– smp
– opencl
– cuda
– combinationsof theprevious
OmpSs @OpenCL applications
Application Characteristics Lines of Host CodeOpenCL / CUDA / OmpSs
API calls/directives OpenCL / CUDA / OmpSs
Matmul 8192x8192 (1024x1024)
292 / 240 / 133 31 / 14 / 3
Julia Set 512x51250 frames
943 / 825 / 770 30 / 11 / 5
Krist 1000 atoms10000 reflections
446 / 342 / 280 30 / 15 / 3
NBody MPI+OmpSs65536 particles
922 / 800 / 798 26 / 7 / 3
✔ Shorter writing compared to OpenCL✔ Less lines of code✔ Less API calls / directives
OmpSs @CUDA applications
✔ Speedup compared to hand-coded CUDA✔ Competitive performance
Matmul Julia Krist NBody NBody MPI0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
CUDA 2 GPUsOmpSs 2 GPUsCUDA 4 GPUsOmpSs 4 GPUsS
pee
dup
OmpSs @OpenCL applications
✔ Speedup compared to hand-coded OpenCL✔ OmpSs also competitive
Matmul Julia Krist NBody NBody MPI0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
OpenCL 2 GPUsOmpSs 2 GPUsOpenCL 4 GPUsOmpSs 4 GPUsS
pee
dup
Matmul
• matmul_block function provides OpenCL kernel– Configured as 2-dimensions on blocks of NBxNB
• Local work group size as BL_SIZE x BL_SIZE
• Data transfers scheduled automatically by Nanos++
#pragma omp target device(opencl) ndrange(2, BS, BS, BL_SIZE, BL_SIZE) copy_deps#pragma omp task depend (in: A[0:BS*BS], B[0:BS*BS]) depend (inout: C[0:BS*BS])__kernel void matmul_block (__global REAL * A, __global REAL * B, __global REAL * C, int BS);
✔ 3 additional directives to original benchmark!✔ Including taskwait at the end
BS
BS
DIM
DIM
BS
BS
Nbody
• calculate_force_func is the kernel
#pragma omp target device(opencl) ndrange(1,size,MAX_NUM_THREADS) \ copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out)#pragma omp task out([size] out) \ in(d_particles[0*size;size], \ d_particles[1*size;size], \ d_particles[2*size;size] , \ d_particles[3*size;size] , \ d_particles[4*size;size] , \ d_particles[5*size;size] , \ d_particles[6*size;size] , \ d_particles[7*size;size] )__kernel void calculate_force_func(int size, float time_interval, int number_of_particles, __global Particle* d_particles, __global Particle *out, int first_local, int last_local);
d_particles
out
23
OmpSs Syntax
#pragma omp target device ({ smp | cuda | opencl }) [ file (filename.cl) ] \ ndrange ( ndim, global_vals, local_vals ) \ [ implements ( function_name )] \ { copy_deps | [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] }
#pragma omp task [ input (...)] [ output (...)] [ inout (...)] \ [ concurrent (…)] [commutative (…)] \ [ priority (p) ] {code block or function}
#pragma omp taskwait [on (...)] [noflush]
• Creating, managing tasks & deps
• Task synchronization
Extending OpenMP semantics
24
Implements feature• One may have several implementations of
– same functionality, on
– heterogeneous devices
SUBROUTINE& VEC_SUM_SMP(N, A, B, RES) integer :: i, n integer :: a(n), b(n), res(n)
do i = 1, N res(i) = a(i) + b(i) end doEND SUBROUTINE
__kernel void vec_sum(int n, __global int* a, __global int* b, __global int* res){ const int idx = get_global_id(0);
if (idx < n) { res[idx] = a[idx] + b[idx]; }}
25
Implements feature• Annotate both interfaces as the second
implementing the first– Let the runtime system to schedule both in the
available resourcesINTERFACE !$OMP TARGET DEVICE(SMP) !$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM_SMP(N, A, B, RES)…
!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128) FILE(vec_sum.cl) & IMPLEMENTS(VEC_SUM_SMP) !$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM(N, A, B, RES)…END INTERFACE
26
Conclusions
• OmpSs programming model– SMPs, GPUs (CUDA & OpenCL)
– C/C++/Fortran
• Shown OmpSs easy of use
• Performance comparable to hand-tuned– Still, scalability on GPUs needs improvements
– Support for constant, textures hardware features need to be included in OmpSs
27
• Keep pushing for these extensions to be included in the OpenMP standard
• Include support for GPU memory types– Constant, texture
• Building OmpSs @FPGAs – Collaboration with Xilinx Dublin Research Lab
• Interoperate with graphics rendering in OpenGL
• Work with Mont-Blanc and DEEP applications
Future work
28
Acknowledgments
• Parallel programming group @BSC
• Encore/Mont-Blanc/DEEP projects– European Commission
• Spanish Ministry of Education
• Catalan Government
OmpSs available at Barcelona Supercomputing Center http://pm.bsc.es