Programming with OmpSs@CUDA/OpenCL

Computer Sciences Research Dept.
Programming with OmpSs@CUDA/OpenCL
• Contact: [email protected] • Source code available from http://pm.bsc.es/ompss/
• Data copies to/from device memory
• Manual work scheduling
h
devh
• Host memory
• Device memory
• Different data sizes due to blocking make the code confusing
h = (float*) malloc(sizeof(*h)*DIM2_H*nr);
r = cudaMalloc((void**)&devh,sizeof(*h)*nr*DIM2_H);
h
devh
• Increased options for data overwrite compared to homogeneous programming
cudaMemcpy(devh,h,sizeof(*h)*nr*DIM2_H, cudaMemcpyHostToDevice);
h
devh
Main.c // Initialize device, context, and buffers ... memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float4) * n, srcB, NULL); // create the kernel kernel = clCreateKernel (program, “dot_product”, NULL); // set the args values err = clSetKernelArg (kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); err |= clSetKernelArg (kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); err |= clSetKernelArg (kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0] = 1; // execute the kernel err = clEnqueueNDRangeKernel (cmd_queue, kernel, 1, NULL, global_work_size,
local_work_size, 0, NULL, NULL); // read results err = clEnqueueReadBuffer (cmd_queue, memobjs[2], CL_TRUE, 0,
n*sizeof(cl_float), dst, 0, NULL, NULL); ...
__kernel void dot_product (
__global const float4 * a, __global const float4 * b, __global float4 * c)
{ int gid = get_global_id(0); c[gid] = dot(a[gid], b[gid]);
}
• Detection of dependencies at runtime
• Automatic data movement
• Thread-pool model • OpenMP parallel “ignored”
• All threads created on startup • One of them (SMP) executes main... and tasks
• P-1 workers (SMP) execute tasks
• One representative (SMP to OpenCL/CUDA) per device
• Experimenting with a single repr. for N devices
• All get work from a task pool • Work is labeled with possible “targets”
• Tasks with several targets are scheduled to different devices at the same time
OmpSs: memory model
• A single global address space • The runtime system takes care of the devices/local memories
• SMP machines: no need for extra runtime support
• Distributed/heterogeneous environments
• Versions of the same data can reside on them
• Data consistency ensured by the runtime system
OmpSs: Task Syntax
{code block or function}
Master waits for sons or specific data availability
#pragma omp taskwait [ on (...) ]
Defaults: data is shared
Master waits for sons or specific data availability
#pragma omp taskwait [ on (...) ]
Defaults: data is shared
Target directive
Task implementation for a device The compiler configures the kernel with ndrange
Support for multiple implementations of a task
Ask the runtime to ensure consistent data is accessible in the address space of the device
#pragma omp target device ({ smp | opencl | cuda }) \ [ implements ( function_name )] \ [ copy_deps | no_copy_deps ] [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, …)] [shmem(...) ] [file(name)] [name(name)]
#pragma omp task [ in (...)] [ out (...)] [ inout (...)] [ concurrent (...)] [ commutative (...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)]
Defaults: data is shared consistency is copy_deps
Possibility to wait for tasks, but not necessarily data transfersPossibility to wait for specific
data, generated from one or several tasks
Target directive Task implementation for a device
The compiler configures the kernel with ndrange
Support for multiple implementations of a task
Ask the runtime to ensure consistent data is accessible in the address space of the device
#pragma omp target device ({ smp | opencl | cuda }) \ [ implements ( function_name )] \ [ copy_deps | no_copy_deps ] [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, …)] [shmem(...) ] [file(name)] [name(name)]
#pragma omp task [ depend(in: array_spec...)] [ depend(out: ...)] [ depend(inout: ...)] \ [ depend(concurrent: …)] [depend (commutative: ...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)]
#pragma omp taskwait [on (…)] [noflush] Possibility to wait for tasks, but not necessarily data transfers
Defaults: data is shared consistency is copy_deps
Possibility to wait for specific data, generated from one or several tasks
•Some devices (opencl, cuda) have their private physical address space
•Data must be maintaned consistent between the original address space of the program and the address space of the device
•copy_deps: By default, all data referenced in in, out, inout clauses is kept consistent
•The copy_in, copy_out, an copy_inout clauses have to be used to specify any other data that needs to be maintained consistent
•The copy_no_deps is a shorthand to specify that for each in/out/inout declaration, there is no need to copy the data
•Tasks on the program host device (smp) also have to specify directionality to ensure consistency for those arguments referenced in some other device(s)
•The default taskwait semantic is to ensure consistency of all the data in the original program address space
Heterogeneity: the copy clauses
• copy_in: requests a consistent copy of the data to be available in the device address space when the task is executed
• It may require a data transfer
• Automatically performed by the runtime • Scheduled at its convenience
• Data may be already in the device
• The runtime keeps track of all data replication and ensures consistency • Address translation may be needed
• copy_out: specifies that the task produces that data
• The runtime has to use this information to ensure data consistency
• May require a transfer sometime after the task completes
• Automatically performed by the runtime, at its convenience • copy_inout: specifies that a consistent copy of the data is required on input and a new
value is produced for which consistency has to be maintained
• ndrange: provides the configuration for the OpenCL/CUDA kernel
ndrange ( ndim, {global/grid}_array, {local/block}_array ) ndrange ( ndim, {global|grid}_dim1, … {local|block}_dim1, … )
1 to 3 dimensions are valid
values can be provided through 1-, 2-, 3-elements arrays (global, local) two lists of 1, 2, or 3 elements, matching the number of dimensions
values can be function arguments or globally accessible variables file: provides the file containing the OpenCL/CUDA source code for the specific kernel
file ( filename ) – Can have several kernels per file
• Kernel selected by the function name name: names the kernel to be invoked
name ( opencl-kernel-name ) – Useful in FORTRAN for capitalization of names
#define BLOCK_SIZE 16 __constant int BL_SIZE= BLOCK_SIZE;
#pragma omp target device(opencl) copy_deps \ ndrange(2,NB,NB,BL_SIZE,BL_SIZE) file (muld.cl)
#pragma omp task in([NB*NB]A,[NB*NB]B) \ inout([NB*NB]C)
__kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB, __global REAL* C, int NB);
void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA, REAL **tileB,REAL **tileC ) {
int i, j, k; for(i = 0;i < mDIM; i++)
for (k = 0; k < lDIM; k++) for (j = 0; j < nDIM; j++)
Muld(tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB, tileC[i*nDIM+j],NB); }
Use __global for copy_in/copy_out arguments
NB
NB
DIM
DIM
NB
NB
#include "matmul_auxiliar_header.h" // defines BLOCK_SIZE
// Device multiplication function // Compute C = A * B // wA is the width of A // wB is the width of B __kernel void Muld( __global REAL* A,
__global REAL* B, int wA, int wB, __global REAL* C, int NB) {
// Block index, Thread index int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1);
// Indexes of the first/last sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin + wA - 1;
...
LGWSIZE, 1) file (julia.cl) #pragma omp task out(framebuffer[0;jc.window_size[0] * jc.window_size[1]]) __kernel void compute_julia(float muP0,float muP1,float muP2,float muP3,
__global uint32_t * framebuffer, struct julia_context jc);
for (i = 0; i < iterations; i++) { getCurMu(currMu, morphTimer);
morphTimer += 0.05f; compute_julia(currMu[0], currMu[1], currMu[2], currMu[3],
framebuffer[OTHER_FRAME(display_frame)], jc); …
copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out) \
ndrange(1, size, MAX_NUM_THREADS) #pragma omp task out([size] out) \
in(d_particles[0*size;1*size], \ d_particles[1*size;2*size], \ d_particles[2*size;3*size] , \ d_particles[3*size;4*size] , \ d_particles[4*size;5*size] , \ d_particles[5*size;6*size] , \ d_particles[6*size;7*size] , \ d_particles[7*size;8*size] )
__kernel void calculate_force_func(int size, float time_interval, int number_of_particles, __global Particle* d_particles,__global Particle *out, int first_local, int last_local);
for ( i = 0; i < number_of_particles; i += bs ) {
calculate_force_func(bs, time_interval, number_of_particles, this_particle_array, &output_array[i], i, i+bs-1);
#pragma omp target device(smp) copy_deps #pragma omp task out([size] output) in([particles] d_particles) void calculate_force_smp(int size, float time_interval, int particles,
Particle* d_particles, Particle *output, int first_local, int last_local) {
for (size_t id = first_local; id <= last_local; id++) {
Particle* this_particle = output + id - first_local;
float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f;
for (int i = 0; i < particles; i++) { if (i != id) {
calculate_force(part + id, part + i, &force_x, &force_y, &force_z);
total_force_x += force_x; total_force_y += force_y; total_force_z += force_z;
} } [...]
} }
__global Particle* part, __global Particle *output, int first_local, int last_local) {
size_t id = get_global_id(0)+first_local; if (id > last_local ) return;
__global Particle* this_particle = output + id - first_local;
float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f;
for (int i = 0; i < particles; i++) { if (i != id) {
calculate_force(part + id, part + I, &force_x, &force_y, &force_z);
total_force_x += force_x; total_force_y += force_y; total_force_z += force_z;
} } […]
// }
OpenCL/CUDA device specifics
• The compiler generates a stub function task that invokes the kernel – Using the information at ndrange and file clauses
• There is a host thread in the runtime representing each device • The task body OpenCL/CUDA code is actually executed by that thread in the host • That thread launches kernels
– Compiles, if necessary – Creates buffers – Sets kernel arguments – Invokes kernel
• The runtime does the memory allocation and deallocation on the device as well as the data transfers for copy variables
• Different memory allocation / copy possibilities
Memory allocation / copy
Device memory allocation / using mapped memory
p = malloc (SIZE); .... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) __kernel void vec_init (__global * p);
p = nanos_opencl_malloc (SIZE); .... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) __kernel void vec_init (__global * p);
OmpSs: the implements clause
• Example of alternative implementations #pragma omp target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++)
b[j] = scalar*c[j]; }
#pragma omp target device (cuda) copy_deps implements (scale_task) ndrange(1, size, 512) #pragma omp task input ([size] c) output ([size] b) void __global__ scale_task_cuda(double *b, double *c, double scalar, int size);
#pragma omp target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++)
b[j] = scalar*c[j]; }
#pragma omp target device (cuda) copy_deps implements (scale_task) ndrange(1, size, 512) #pragma omp task input ([size] c) output ([size] b) void __global__ scale_task_cuda(double *b, double *c, double scalar, int size);
};
void scale_task_cuda(double *b, double *c, double scalar, int size); { int j = blockIdx.x * blockDim.x + threadIdx.x;
b[j] = scalar * c[j]; }
};
void scale_task_cuda(double *b, double *c, double scalar, int size); { int j = blockIdx.x * blockDim.x + threadIdx.x;
b[j] = scalar * c[j]; }
scale.c
Heterogeneity: the target directive
#pragma omp target device (cuda) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task_cuda(double *b, double *c, double scalar, int size) {
dim3 dimBlock, dimGrid; dimBlock.x = 128; dimBlock.y = dimBlock.z = 1; dimGrid.x = size/128+1; scale_kernel<<<dimGrid,dimBlock>>>(size, 1, b, c, scalar);
}
#pragma omp target device (cuda) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task_cuda(double *b, double *c, double scalar, int size) {
dim3 dimBlock, dimGrid; dimBlock.x = 128; dimBlock.y = dimBlock.z = 1; dimGrid.x = size/128+1; scale_kernel<<<dimGrid,dimBlock>>>(size, 1, b, c, scalar);
}
double A[1024], B[1024], C[1024] double D[1024], E[1024];
}
}
A, B have to be transferred to device before task execution
No data transfer. Will execute after T1
A has to be transferred to host. Can be done in parallel with T2
D, E, have to be transferred to GPU. Can be done at the very beginning
#pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; }
#pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; }
Copy D, E back to host
C has to be transferred to GPU. Can be done when T3 finishes
Executed in the host
Kernel launched to the device. b and c are pointers to the memory
allocated by the runtime on the device
Heterogeneity: relaxing consistency
main(){ … scale_task_cuda (A, B, 1.0, 1024); //T1 scale_task_cuda (B, A, 0.1, 1024); //T2 scale_task_nocpin (C, A, 2.0, 1024); //T3 scale_task_cuda (D, E, 5.0, 1024); //T4 scale_task_cuda (C, B, 3.0, 1024); //T5
#pragma omp taskwait noflush
printf (“%f, %f\n”, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6 scale_task_nocpin (D, C, 0.5, 1024); //T7
#pragma omp taskwait noflush }
main(){ … scale_task_cuda (A, B, 1.0, 1024); //T1 scale_task_cuda (B, A, 0.1, 1024); //T2 scale_task_nocpin (C, A, 2.0, 1024); //T3 scale_task_cuda (D, E, 5.0, 1024); //T4 scale_task_cuda (C, B, 3.0, 1024); //T5
#pragma omp taskwait noflush
printf (“%f, %f\n”, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6 scale_task_nocpin (D, C, 0.5, 1024); //T7
#pragma omp taskwait noflush }
No data transfer. Will execute after T1
A, is not transferred to host. T3 will execute after T1 finishes but with old value of A
D, E, have to be transferred to GPU. Can be done at the very beginning
No copy of A, B, C, D, E back to host
C value updated by T3 has to be transferred to GPU. Can be done when T3 finishes.
#pragma target device (smp) no_copy_deps copy_out ([size] b) #pragma omp task input ([size] c) output ([size] b) void scale_task_nocpin (double *b, double *c, double scalar, int size) {
for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; }
#pragma target device (smp) no_copy_deps copy_out ([size] b) #pragma omp task input ([size] c) output ([size] b) void scale_task_nocpin (double *b, double *c, double scalar, int size) {
for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; }
Will print original unmodified value of D and E
A, B have to be transferred to device before task execution
Copy B and E back to host before execution
No copy of C back to host No copy of D, back to host but new value defined by T7
No copy of consistent instance of input to smp. Definition of new value for output
OmpSs runtime features
• Automatic and configurable overlapping of data transfers and computation • In / out / inout
• with the kernel executions • Data prefetching
• Transfer input data to GPU memory
• well before tasks need the data
• Configurable data cache policy • Write-through
• after the execution of each task, output data is updated on the host • Write-back
• only updates host data if needed • Nocache
• cache is disabled
• Configurable limit on GPU memory usage
• CUBLAS support and context management • No need to call cublasInit / cublasShutdown
• NX_GPUCUBLASINIT • CUBLAS functions can also be scheduled to specific streams to enable overlapping
• Runtime takes care of properly allocation of streams/handles
#pragma omp target device(cuda) \ ndrange(2,img_width,BS,128,1) file (perlin.cu)
#pragma omp task out (output[0:BS*rowstride-1]) __global__ void cuda_perlin (pixel output [ ], float time, int j,
int rowstride,int img_width, int BS);
OmpSs Perlin Noise
embedded on host files
versioning → checks performance and decides how to use devices
# additional arguments to pass to Nanos++ NX_ARGS="..."
# enabling priority support NX_ARGS="--schedule-priority" NX_ARGS="--schedule-smart-priority"
# Caching mechanism NX_CACHE_POLICY=<wt | wb | nocache> # cache policy to use
# Generating Paraver traces NX_INSTRUMENTATION=extrae # enables instrumentation EXTRAE_CONFIG_FILE=xmlfile # sets the XML file for Extrae to use
# Disabling CUDA/OpenCL support NX_ARGS=" --disable-cuda --disable-opencl "
Execution options (OpenCL)
NX_DISABLEOPENCL=<yes | no>
Execution options (CUDA)
NX_GPUPREFETCH=<yes | no> # enable prefetch of task data NX_GPUOVERLAP=<yes | no> # enable overlapping of transfers NX_GPUOVERLAP_INPUTS=<yes | no> # and computation NX_GPUOVERLAP_OUTPUTS=<yes | no> NX_GPUMAXMEM=<M> # amount of memory to preallocate
NX_GPUCUBLASINIT=yes # initialize CUBLAS
[NDRANGE (…)]& [IMPLEMENTS (function_name)]&
!$OMP TASK [IN (…)] [OUT(...)] [INOUT (…)] [CONCURRENT (…)][COMMUTATIVE (…)] & [PRIORITY (…)][LABEL (…)] & [SHARED(…)][PRIVATE(…)][FIRSTPRIVATE(…)][DEFAULT (…)] & [UNTIED][IF(EXPRESSION)][FINAL(…)] { code or function }
!$OMP TASKWAIT [ON(…)][NOFLUSH]
!$OMP TARGET DEVICE({SMP|CUDA|OPENCL|MPI})&
[NDRANGE (…)]& [IMPLEMENTS (function_name)]&
!$OMP TASK [DEPEND(IN: …)] [DEPEND(OUT: ...)] [DEPEND(INOUT: …)] & [DEPEND(CONCURRENT: …)] [DEPEND(COMMUTATIVE: …)] & [PRIORITY (…)][LABEL (…)] & [SHARED(…)][PRIVATE(…)][FIRSTPRIVATE(…)][DEFAULT (…)] & [UNTIED][IF(EXPRESSION)][FINAL(EXPRESSION)]
{ code or function }
INTERFACE
!$OMP TASK IN (A, H) OUT(E)
SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E)
INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE
REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE)
END SUBROUTINE CSTRUCTFAC
IND_H = (DIM2_H * (II - 1)) + 1
IND_E = (DIM2_E * (II - 1)) + 1
CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, &
DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),&
DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1))
END DO
SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
IMPLICIT NONE
INTEGER :: N
DO I=1,N
FILE(kernel.cl) COPY_DEPS
IMPLICIT NONE
!$OMP TASKWAIT
! IN = 160B
! OUT = 80B
VEC1 and VEC2 are sent to the GPU. No output transfer (write back)
input data is already in the GPU. No input/output transfer
RESULTS copied out from the GPU
SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
IMPLICIT NONE
INTEGER :: N
DO I=1,N
IMPLICIT NONE
!$OMP TASKWAIT
! IN = 160B
! OUT = 80B
input data is already in the GPU. No input/output transfer
RESULTS copied out from the GPU
{
}
}
IMPLICIT NONE
INTEGER :: N
DO I=1,N
IMPLICIT NONE
!$OMP TASKWAIT
!$OMP TASKWAIT
RESULTS copied out from the GPU and removed
IMPLICIT NONE
INTEGER :: N
DO I=1,N
IMPLICIT NONE
!$OMP TASKWAIT NOFLUSH
!$OMP TASKWAIT
FORTRAN example, assuming write-back cache policy
NO data copied out from the GPU
All data already in the GPU. No output transfer
Printed values are -1
IMPLICIT NONE
CALL PRINT_BUFF(N, RESULTS)
CALL PRINT_BUFF(N, RESULTS)
FORTRAN example, assuming write-back cache policy
VEC1 and VEC2 are sent to the GPU No output transfer
RESULTS copied to CPU and kept in GPU
RESULTS copied to CPU and kept in GPU
input data is already in the GPU No input transfer
RESULT copied out from the GPU and removed from there
• Support for heterogeneous/hierarchical architectures
• Support asynchrony global synchronization in systems with large number of nodes is not an answer anymore
• Be aware of data locality
• OmpSs is a proposal that enables: • Incremental parallelization from existing sequential codes
• Data-flow execution model that naturally supports asynchrony
• Nicely integrates heterogeneity and hierarchy
• Support for locality scheduling
http://pm.bsc.es/ompss

Documents

Programming with OmpSs@CUDA/OpenCL