Computer Sciences Research Dept.
Programming with OmpSs@CUDA/OpenCL
• Contact:
[email protected] • Source code available from
http://pm.bsc.es/ompss/
Programming with OmpSs@CUDA/OpenCL
• Data copies to/from device memory
• Manual work scheduling
h
devh
• Host memory
• Device memory
• Different data sizes due to blocking make the code
confusing
h = (float*) malloc(sizeof(*h)*DIM2_H*nr);
r = cudaMalloc((void**)&devh,sizeof(*h)*nr*DIM2_H);
h
devh
• Increased options for data overwrite compared to homogeneous
programming
cudaMemcpy(devh,h,sizeof(*h)*nr*DIM2_H,
cudaMemcpyHostToDevice);
h
devh
Main.c // Initialize device, context, and buffers ... memobjs[1] =
clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR,
sizeof(cl_float4) * n, srcB, NULL); // create the kernel kernel =
clCreateKernel (program, “dot_product”, NULL); // set the args
values err = clSetKernelArg (kernel, 0, sizeof(cl_mem), (void *)
&memobjs[0]); err |= clSetKernelArg (kernel, 1, sizeof(cl_mem),
(void *) &memobjs[1]); err |= clSetKernelArg (kernel, 2,
sizeof(cl_mem), (void *) &memobjs[2]); // set work-item
dimensions global_work_size[0] = n; local_work_size[0] = 1; //
execute the kernel err = clEnqueueNDRangeKernel (cmd_queue, kernel,
1, NULL, global_work_size,
local_work_size, 0, NULL, NULL); // read results err =
clEnqueueReadBuffer (cmd_queue, memobjs[2], CL_TRUE, 0,
n*sizeof(cl_float), dst, 0, NULL, NULL); ...
__kernel void dot_product (
__global const float4 * a, __global const float4 * b, __global
float4 * c)
{ int gid = get_global_id(0); c[gid] = dot(a[gid], b[gid]);
}
• Detection of dependencies at runtime
• Automatic data movement
• Thread-pool model • OpenMP parallel “ignored”
• All threads created on startup • One of them (SMP) executes
main... and tasks
• P-1 workers (SMP) execute tasks
• One representative (SMP to OpenCL/CUDA) per device
• Experimenting with a single repr. for N devices
• All get work from a task pool • Work is labeled with possible
“targets”
• Tasks with several targets are scheduled to different devices at
the same time
Programming with OmpSs@CUDA/OpenCL
OmpSs: memory model
• A single global address space • The runtime system takes care of
the devices/local memories
• SMP machines: no need for extra runtime support
• Distributed/heterogeneous environments
• Versions of the same data can reside on them
• Data consistency ensured by the runtime system
Programming with OmpSs@CUDA/OpenCL
OmpSs: Task Syntax
{code block or function}
Master waits for sons or specific data availability
#pragma omp taskwait [ on (...) ]
Defaults: data is shared
{code block or function}
Master waits for sons or specific data availability
#pragma omp taskwait [ on (...) ]
Defaults: data is shared
Target directive
Task implementation for a device The compiler configures the kernel
with ndrange
Support for multiple implementations of a task
Ask the runtime to ensure consistent data is accessible in the
address space of the device
#pragma omp target device ({ smp | opencl | cuda }) \ [ implements
( function_name )] \ [ copy_deps | no_copy_deps ] [ copy_in (
array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \
[ndrange (dim, …)] [shmem(...) ] [file(name)] [name(name)]
#pragma omp task [ in (...)] [ out (...)] [ inout (...)] [
concurrent (...)] [ commutative (...)] \ [ priority (P) ] [ label
(name) ] \ [
shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if
(expression)]
{code block or function}
Defaults: data is shared consistency is copy_deps
Possibility to wait for tasks, but not necessarily data
transfersPossibility to wait for specific
data, generated from one or several tasks
Programming with OmpSs@CUDA/OpenCL
Target directive Task implementation for a device
The compiler configures the kernel with ndrange
Support for multiple implementations of a task
Ask the runtime to ensure consistent data is accessible in the
address space of the device
#pragma omp target device ({ smp | opencl | cuda }) \ [ implements
( function_name )] \ [ copy_deps | no_copy_deps ] [ copy_in (
array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \
[ndrange (dim, …)] [shmem(...) ] [file(name)] [name(name)]
#pragma omp task [ depend(in: array_spec...)] [ depend(out: ...)] [
depend(inout: ...)] \ [ depend(concurrent: …)] [depend
(commutative: ...)] \ [ priority (P) ] [ label (name) ] \ [
shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if
(expression)]
{code block or function}
#pragma omp taskwait [on (…)] [noflush] Possibility to wait for
tasks, but not necessarily data transfers
Defaults: data is shared consistency is copy_deps
Possibility to wait for specific data, generated from one or
several tasks
Programming with OmpSs@CUDA/OpenCL
•Some devices (opencl, cuda) have their private physical address
space
•Data must be maintaned consistent between the original address
space of the program and the address space of the device
•copy_deps: By default, all data referenced in in, out, inout
clauses is kept consistent
•The copy_in, copy_out, an copy_inout clauses have to be used to
specify any other data that needs to be maintained consistent
•The copy_no_deps is a shorthand to specify that for each
in/out/inout declaration, there is no need to copy the data
•Tasks on the program host device (smp) also have to specify
directionality to ensure consistency for those arguments referenced
in some other device(s)
•The default taskwait semantic is to ensure consistency of all the
data in the original program address space
Programming with OmpSs@CUDA/OpenCL
Heterogeneity: the copy clauses
• copy_in: requests a consistent copy of the data to be available
in the device address space when the task is executed
• It may require a data transfer
• Automatically performed by the runtime • Scheduled at its
convenience
• Data may be already in the device
• The runtime keeps track of all data replication and ensures
consistency • Address translation may be needed
• copy_out: specifies that the task produces that data
• The runtime has to use this information to ensure data
consistency
• May require a transfer sometime after the task completes
• Automatically performed by the runtime, at its convenience •
copy_inout: specifies that a consistent copy of the data is
required on input and a new
value is produced for which consistency has to be maintained
Programming with OmpSs@CUDA/OpenCL
• ndrange: provides the configuration for the OpenCL/CUDA
kernel
ndrange ( ndim, {global/grid}_array, {local/block}_array ) ndrange
( ndim, {global|grid}_dim1, … {local|block}_dim1, … )
1 to 3 dimensions are valid
values can be provided through 1-, 2-, 3-elements arrays (global,
local) two lists of 1, 2, or 3 elements, matching the number of
dimensions
values can be function arguments or globally accessible variables
file: provides the file containing the OpenCL/CUDA source code for
the specific kernel
file ( filename ) – Can have several kernels per file
• Kernel selected by the function name name: names the kernel to be
invoked
name ( opencl-kernel-name ) – Useful in FORTRAN for capitalization
of names
Programming with OmpSs@CUDA/OpenCL
#define BLOCK_SIZE 16 __constant int BL_SIZE= BLOCK_SIZE;
#pragma omp target device(opencl) copy_deps \
ndrange(2,NB,NB,BL_SIZE,BL_SIZE) file (muld.cl)
#pragma omp task in([NB*NB]A,[NB*NB]B) \ inout([NB*NB]C)
__kernel void Muld( __global REAL* A, __global REAL* B, int wA, int
wB, __global REAL* C, int NB);
void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM,
REAL **tileA, REAL **tileB,REAL **tileC ) {
int i, j, k; for(i = 0;i < mDIM; i++)
for (k = 0; k < lDIM; k++) for (j = 0; j < nDIM; j++)
Muld(tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB, tileC[i*nDIM+j],NB);
}
Use __global for copy_in/copy_out arguments
NB
NB
DIM
DIM
NB
NB
#include "matmul_auxiliar_header.h" // defines BLOCK_SIZE
// Device multiplication function // Compute C = A * B // wA is the
width of A // wB is the width of B __kernel void Muld( __global
REAL* A,
__global REAL* B, int wA, int wB, __global REAL* C, int NB) {
// Block index, Thread index int bx = get_group_id(0); int by =
get_group_id(1); int tx = get_local_id(0); int ty =
get_local_id(1);
// Indexes of the first/last sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin + wA -
1;
...
LGWSIZE, 1) file (julia.cl) #pragma omp task
out(framebuffer[0;jc.window_size[0] * jc.window_size[1]]) __kernel
void compute_julia(float muP0,float muP1,float muP2,float
muP3,
__global uint32_t * framebuffer, struct julia_context jc);
for (i = 0; i < iterations; i++) { getCurMu(currMu,
morphTimer);
morphTimer += 0.05f; compute_julia(currMu[0], currMu[1], currMu[2],
currMu[3],
framebuffer[OTHER_FRAME(display_frame)], jc); …
copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out)
\
ndrange(1, size, MAX_NUM_THREADS) #pragma omp task out([size] out)
\
in(d_particles[0*size;1*size], \ d_particles[1*size;2*size], \
d_particles[2*size;3*size] , \ d_particles[3*size;4*size] , \
d_particles[4*size;5*size] , \ d_particles[5*size;6*size] , \
d_particles[6*size;7*size] , \ d_particles[7*size;8*size] )
__kernel void calculate_force_func(int size, float time_interval,
int number_of_particles, __global Particle* d_particles,__global
Particle *out, int first_local, int last_local);
for ( i = 0; i < number_of_particles; i += bs ) {
calculate_force_func(bs, time_interval, number_of_particles,
this_particle_array, &output_array[i], i, i+bs-1);
#pragma omp target device(smp) copy_deps #pragma omp task
out([size] output) in([particles] d_particles) void
calculate_force_smp(int size, float time_interval, int
particles,
Particle* d_particles, Particle *output, int first_local, int
last_local) {
for (size_t id = first_local; id <= last_local; id++) {
Particle* this_particle = output + id - first_local;
float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float
total_force_x = 0.0f; float total_force_y = 0.0f; float
total_force_z = 0.0f;
for (int i = 0; i < particles; i++) { if (i != id) {
calculate_force(part + id, part + i, &force_x, &force_y,
&force_z);
total_force_x += force_x; total_force_y += force_y; total_force_z
+= force_z;
} } [...]
} }
__global Particle* part, __global Particle *output, int
first_local, int last_local) {
size_t id = get_global_id(0)+first_local; if (id > last_local )
return;
__global Particle* this_particle = output + id - first_local;
float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float
total_force_x = 0.0f; float total_force_y = 0.0f; float
total_force_z = 0.0f;
for (int i = 0; i < particles; i++) { if (i != id) {
calculate_force(part + id, part + I, &force_x, &force_y,
&force_z);
total_force_x += force_x; total_force_y += force_y; total_force_z
+= force_z;
} } […]
// }
Programming with OmpSs@CUDA/OpenCL
OpenCL/CUDA device specifics
• The compiler generates a stub function task that invokes the
kernel – Using the information at ndrange and file clauses
• There is a host thread in the runtime representing each device •
The task body OpenCL/CUDA code is actually executed by that thread
in the host • That thread launches kernels
– Compiles, if necessary – Creates buffers – Sets kernel arguments
– Invokes kernel
• The runtime does the memory allocation and deallocation on the
device as well as the data transfers for copy variables
• Different memory allocation / copy possibilities
Programming with OmpSs@CUDA/OpenCL
Memory allocation / copy
Device memory allocation / using mapped memory
p = malloc (SIZE); .... #pragma omp target device (opencl) ndrange
(1, SIZE, 16) #pragma omp task inout ([SIZE] p) __kernel void
vec_init (__global * p);
p = nanos_opencl_malloc (SIZE); .... #pragma omp target device
(opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p)
__kernel void vec_init (__global * p);
Programming with OmpSs@CUDA/OpenCL
OmpSs: the implements clause
• Example of alternative implementations #pragma omp target device
(smp) copy_deps #pragma omp task input ([size] c) output ([size] b)
void scale_task (double *b, double *c, double scalar, int size) {
int j; for (j=0; j < size; j++)
b[j] = scalar*c[j]; }
#pragma omp target device (cuda) copy_deps implements (scale_task)
ndrange(1, size, 512) #pragma omp task input ([size] c) output
([size] b) void __global__ scale_task_cuda(double *b, double *c,
double scalar, int size);
#pragma omp target device (smp) copy_deps #pragma omp task input
([size] c) output ([size] b) void scale_task (double *b, double *c,
double scalar, int size) { int j; for (j=0; j < size; j++)
b[j] = scalar*c[j]; }
#pragma omp target device (cuda) copy_deps implements (scale_task)
ndrange(1, size, 512) #pragma omp task input ([size] c) output
([size] b) void __global__ scale_task_cuda(double *b, double *c,
double scalar, int size);
};
void scale_task_cuda(double *b, double *c, double scalar, int
size); { int j = blockIdx.x * blockDim.x + threadIdx.x;
b[j] = scalar * c[j]; }
};
void scale_task_cuda(double *b, double *c, double scalar, int
size); { int j = blockIdx.x * blockDim.x + threadIdx.x;
b[j] = scalar * c[j]; }
scale.c
Heterogeneity: the target directive
#pragma omp target device (cuda) copy_deps #pragma omp task input
([size] c) output ([size] b) void scale_task_cuda(double *b, double
*c, double scalar, int size) {
dim3 dimBlock, dimGrid; dimBlock.x = 128; dimBlock.y = dimBlock.z =
1; dimGrid.x = size/128+1;
scale_kernel<<<dimGrid,dimBlock>>>(size, 1, b, c,
scalar);
}
#pragma omp target device (cuda) copy_deps #pragma omp task input
([size] c) output ([size] b) void scale_task_cuda(double *b, double
*c, double scalar, int size) {
dim3 dimBlock, dimGrid; dimBlock.x = 128; dimBlock.y = dimBlock.z =
1; dimGrid.x = size/128+1;
scale_kernel<<<dimGrid,dimBlock>>>(size, 1, b, c,
scalar);
}
double A[1024], B[1024], C[1024] double D[1024], E[1024];
}
double A[1024], B[1024], C[1024] double D[1024], E[1024];
}
A, B have to be transferred to device before task execution
No data transfer. Will execute after T1
A has to be transferred to host. Can be done in parallel with
T2
D, E, have to be transferred to GPU. Can be done at the very
beginning
#pragma target device (smp) copy_deps #pragma omp task input
([size] c) output ([size] b) void scale_task (double *b, double *c,
double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] =
scalar*c[j]; }
#pragma target device (smp) copy_deps #pragma omp task input
([size] c) output ([size] b) void scale_task (double *b, double *c,
double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] =
scalar*c[j]; }
Copy D, E back to host
C has to be transferred to GPU. Can be done when T3 finishes
Executed in the host
Kernel launched to the device. b and c are pointers to the
memory
allocated by the runtime on the device
Programming with OmpSs@CUDA/OpenCL
Heterogeneity: relaxing consistency
double A[1024], B[1024], C[1024] double D[1024], E[1024];
main(){ … scale_task_cuda (A, B, 1.0, 1024); //T1 scale_task_cuda
(B, A, 0.1, 1024); //T2 scale_task_nocpin (C, A, 2.0, 1024); //T3
scale_task_cuda (D, E, 5.0, 1024); //T4 scale_task_cuda (C, B, 3.0,
1024); //T5
#pragma omp taskwait noflush
printf (“%f, %f\n”, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6
scale_task_nocpin (D, C, 0.5, 1024); //T7
#pragma omp taskwait noflush }
double A[1024], B[1024], C[1024] double D[1024], E[1024];
main(){ … scale_task_cuda (A, B, 1.0, 1024); //T1 scale_task_cuda
(B, A, 0.1, 1024); //T2 scale_task_nocpin (C, A, 2.0, 1024); //T3
scale_task_cuda (D, E, 5.0, 1024); //T4 scale_task_cuda (C, B, 3.0,
1024); //T5
#pragma omp taskwait noflush
printf (“%f, %f\n”, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6
scale_task_nocpin (D, C, 0.5, 1024); //T7
#pragma omp taskwait noflush }
No data transfer. Will execute after T1
A, is not transferred to host. T3 will execute after T1 finishes
but with old value of A
D, E, have to be transferred to GPU. Can be done at the very
beginning
No copy of A, B, C, D, E back to host
C value updated by T3 has to be transferred to GPU. Can be done
when T3 finishes.
#pragma target device (smp) no_copy_deps copy_out ([size] b)
#pragma omp task input ([size] c) output ([size] b) void
scale_task_nocpin (double *b, double *c, double scalar, int size)
{
for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; }
#pragma target device (smp) no_copy_deps copy_out ([size] b)
#pragma omp task input ([size] c) output ([size] b) void
scale_task_nocpin (double *b, double *c, double scalar, int size)
{
for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; }
Will print original unmodified value of D and E
A, B have to be transferred to device before task execution
Copy B and E back to host before execution
No copy of C back to host No copy of D, back to host but new value
defined by T7
No copy of consistent instance of input to smp. Definition of new
value for output
Programming with OmpSs@CUDA/OpenCL
OmpSs runtime features
• Automatic and configurable overlapping of data transfers and
computation • In / out / inout
• with the kernel executions • Data prefetching
• Transfer input data to GPU memory
• well before tasks need the data
Programming with OmpSs@CUDA/OpenCL
OmpSs runtime features
• Configurable data cache policy • Write-through
• after the execution of each task, output data is updated on the
host • Write-back
• only updates host data if needed • Nocache
• cache is disabled
• Configurable limit on GPU memory usage
Programming with OmpSs@CUDA/OpenCL
OmpSs runtime features
• CUBLAS support and context management • No need to call
cublasInit / cublasShutdown
• NX_GPUCUBLASINIT • CUBLAS functions can also be scheduled to
specific streams to enable overlapping
• Runtime takes care of properly allocation of
streams/handles
Programming with OmpSs@CUDA/OpenCL
#pragma omp target device(cuda) \ ndrange(2,img_width,BS,128,1)
file (perlin.cu)
#pragma omp task out (output[0:BS*rowstride-1]) __global__ void
cuda_perlin (pixel output [ ], float time, int j,
int rowstride,int img_width, int BS);
OmpSs Perlin Noise
embedded on host files
versioning → checks performance and decides how to use
devices
# additional arguments to pass to Nanos++ NX_ARGS="..."
# enabling priority support NX_ARGS="--schedule-priority"
NX_ARGS="--schedule-smart-priority"
Programming with OmpSs@CUDA/OpenCL
# Caching mechanism NX_CACHE_POLICY=<wt | wb | nocache> #
cache policy to use
# Generating Paraver traces NX_INSTRUMENTATION=extrae # enables
instrumentation EXTRAE_CONFIG_FILE=xmlfile # sets the XML file for
Extrae to use
# Disabling CUDA/OpenCL support NX_ARGS=" --disable-cuda
--disable-opencl "
Programming with OmpSs@CUDA/OpenCL
Execution options (OpenCL)
NX_DISABLEOPENCL=<yes | no>
Programming with OmpSs@CUDA/OpenCL
Execution options (CUDA)
NX_GPUPREFETCH=<yes | no> # enable prefetch of task data
NX_GPUOVERLAP=<yes | no> # enable overlapping of transfers
NX_GPUOVERLAP_INPUTS=<yes | no> # and computation
NX_GPUOVERLAP_OUTPUTS=<yes | no> NX_GPUMAXMEM=<M> #
amount of memory to preallocate
NX_GPUCUBLASINIT=yes # initialize CUBLAS
[NDRANGE (…)]& [IMPLEMENTS (function_name)]&
!$OMP TASK [IN (…)] [OUT(...)] [INOUT (…)] [CONCURRENT
(…)][COMMUTATIVE (…)] & [PRIORITY (…)][LABEL (…)] &
[SHARED(…)][PRIVATE(…)][FIRSTPRIVATE(…)][DEFAULT (…)] &
[UNTIED][IF(EXPRESSION)][FINAL(…)] { code or function }
!$OMP TASKWAIT [ON(…)][NOFLUSH]
Programming with OmpSs@CUDA/OpenCL
!$OMP TARGET DEVICE({SMP|CUDA|OPENCL|MPI})&
[NDRANGE (…)]& [IMPLEMENTS (function_name)]&
!$OMP TASK [DEPEND(IN: …)] [DEPEND(OUT: ...)] [DEPEND(INOUT: …)]
& [DEPEND(CONCURRENT: …)] [DEPEND(COMMUTATIVE: …)] &
[PRIORITY (…)][LABEL (…)] &
[SHARED(…)][PRIVATE(…)][FIRSTPRIVATE(…)][DEFAULT (…)] &
[UNTIED][IF(EXPRESSION)][FINAL(EXPRESSION)]
{ code or function }
Programming with OmpSs@CUDA/OpenCL
INTERFACE
!$OMP TASK IN (A, H) OUT(E)
SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE,
E)
INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE
REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE)
END SUBROUTINE CSTRUCTFAC
IND_H = (DIM2_H * (II - 1)) + 1
IND_E = (DIM2_E * (II - 1)) + 1
CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, &
DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),&
DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1))
END DO
SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
IMPLICIT NONE
INTEGER :: N
DO I=1,N
FILE(kernel.cl) COPY_DEPS
IMPLICIT NONE
!$OMP TASKWAIT
! IN = 160B
! OUT = 80B
VEC1 and VEC2 are sent to the GPU. No output transfer (write
back)
input data is already in the GPU. No input/output transfer
RESULTS copied out from the GPU
Programming with OmpSs@CUDA/OpenCL
SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
IMPLICIT NONE
INTEGER :: N
DO I=1,N
FILE(kernel.cl) COPY_DEPS
IMPLICIT NONE
!$OMP TASKWAIT
! IN = 160B
! OUT = 80B
VEC1 and VEC2 are sent to the GPU. No output transfer (write
back)
input data is already in the GPU. No input/output transfer
RESULTS copied out from the GPU
{
}
}
IMPLICIT NONE
INTEGER :: N
DO I=1,N
IMPLICIT NONE
!$OMP TASKWAIT
!$OMP TASKWAIT
RESULTS copied out from the GPU and removed
VEC1 and VEC2 are sent to the GPU. No output transfer (write
back)
RESULTS copied out from the GPU and removed
VEC1 and VEC2 are sent to the GPU. No output transfer (write
back)
Programming with OmpSs@CUDA/OpenCL
IMPLICIT NONE
INTEGER :: N
DO I=1,N
IMPLICIT NONE
!$OMP TASKWAIT NOFLUSH
!$OMP TASKWAIT
FORTRAN example, assuming write-back cache policy
VEC1 and VEC2 are sent to the GPU. No output transfer (write
back)
NO data copied out from the GPU
All data already in the GPU. No output transfer
RESULTS copied out from the GPU and removed
Printed values are -1
FILE(kernel.cl) COPY_DEPS
IMPLICIT NONE
CALL PRINT_BUFF(N, RESULTS)
CALL PRINT_BUFF(N, RESULTS)
FORTRAN example, assuming write-back cache policy
VEC1 and VEC2 are sent to the GPU No output transfer
RESULTS copied to CPU and kept in GPU
RESULTS copied to CPU and kept in GPU
input data is already in the GPU No input transfer
RESULT copied out from the GPU and removed from there
Programming with OmpSs@CUDA/OpenCL
• Support for heterogeneous/hierarchical architectures
• Support asynchrony global synchronization in systems with large
number of nodes is not an answer anymore
• Be aware of data locality
• OmpSs is a proposal that enables: • Incremental parallelization
from existing sequential codes
• Data-flow execution model that naturally supports
asynchrony
• Nicely integrates heterogeneity and hierarchy
• Support for locality scheduling
http://pm.bsc.es/ompss