View
295
Download
1
Category
Tags:
Preview:
Citation preview
www.bsc.es
Task-Based Programming with OmpSs and its Application
Facultad de Informática, UCM
Madrid, 4 de Nov 2014
2
Outline
Motivation StarSs and OmpSs basics OmpSs flavors OmpSs environment eDSLs on top of OmpSs Conclusions
3
Exascale challenge, or how to make it the HPC comfort zone
The Learning Zone model establishes a theory of how performance of a person can be enhanced and their skills optimized
– Comfort Zone: feel comfortable and do not have to take any risks
– Learning Zone: just outside of our secure environment, we grow and learn
– Panic Zone: all our energy is used up for managing/controlling our anxiety and no energy can flow into learning.
Moving to the learning zone, enables to extend the comfort zone, moving towards the panic zone
When following a personal dream or vision, individuals need to move to the learning zone and take controlled risks, in order to achieve the challenges of their panic zone
Social pedagogy
* The Learning Zone Model (Senninger, 2000)
Exascales poses different challenges to HPC … away from the current comfort zone … maybe in the panic zone???
4
The parallel programming comfort zone
State of the art parallel programming – Where to place data – What to run where – How to communicate
Parallel programming in the future – What do I need to compute – What data do I need to use – Hints (not necessarily very precise) on
potential concurrency, locality,…
Static scheduling, all decisions controlled by the programmer
Dynamic scheduling, optimizations decided by
runtime, loose of control by the programmer
Comfort zone
Panic? Zone
5
Parallel programming evolution
At the beginning there was one language
Simple interface Sequential program
ILP
ISA / API
Programs “decoupled”
from hardware
Applications
6
Parallel programming evolution
Multicores and heterogeneous processors made the interface to leak
ISA / API Address spaces (hierarchy,
transfer), specific instructions, …
Applications
Program logic +
Platform specificities
Applications
BSC vision in programming
Need to decouple again
General purpose Task based
Single address space
“Reuse” architectural ideas
under new constraints
Program logic
Arch. independent Applications
Power to the runtime
PM: High-level, clean, abstract interface
ISA / API
BSC Vision in the programming
Special purpose
Must be easy to develop/maintain
Fast development, more expressivity Applications
Power to the runtime
PM: High-level, clean, abstract interface
DSL1 DSL2 DSL3
ISA / API
STARSs basic idea
... for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9); } ...
Sequential Application
T10 T20
T30 T40
T50 T11 T21
T31 T41
T51
T12
…
Resource 1
Resource 2
Resource 3
Resource N
.
.
.
Task graph creation
based on data
precedence
Task selection + parameters direction (input, output, inout)
Scheduling,
data transfer,
task execution
Synchronization, results transfer
Parallel Resources (multicore, GPU, cluster, cloud, grid)
Write Decouple how we write from how it is executed
Execute
10
OmpSs vs OpenMP OpenMP 3.0 includes tasks (2008) – No dependencies
OpenMP 4.0 includes – Task dependencies (2013)
• Overlapped or strided regions not supported – Support to accelerators
• Static support to the device, without integration with dynamic scheduling • Based on compilation from C
OmpSs supports task dependencies – Main feature
OmpSs support to accelerators – Leveraging CUDA, OpenCL – Integrated in the dynamic scheduling – Support to multiple devices, automatic data transfers – Support to versioning
Other OmpSs features – Support to overlapped/strided regions – Concurrent/Commutative clause
OmpSs: task dependencies int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) matmul_tile( C[i][j], A[i][k], B[k][j], BS); }
#pragma omp task inout([BS*BS]C) in([BS*BS]A, [BS*BS]B) void matmul_tile (float *A, float *B, float *C , int BS) { int i, j, k; for (i = 0; i < BS; i++) for (j = 0; j < BS; j++) for (k = 0; k < BS; k++) { C[i*BS+j] += A[i*BS+k] * B[k*BS+j]; } }
OmpSs: defining array sections
int a[N][M]; #pragma omp task in(a[2:3][3:4]) // 2 x 2 subblock of a at a[2][3]
int a[N][M]; #pragma omp task in(a[1:2][0:M-1]) //rows 1 and 2
int a[N][M]; #pragma omp task in(a[0:N-1][0:M-1]) //whole matrix used to compute dependences
int a[N][M]; #pragma omp task in(a[0;N][0;M]) //whole matrix used to compute dependences
=
int a[N][M]; #pragma omp task in(a[2;2][3;2]) // 2 x 2 subblock of a at a[2][3]
=
OmpSs examples: Serialized reduction pattern
for (int j=0; j<N; j+=BS){ actual_size = (N- j> BS ? BS: N-j); #pragma omp task in (vec[j;actual_size]) inout(result)
for (int count = 0; count < actual_size; count ++, j++) result += vec [j] ;
} #pragma omp task input (result)
printf (“TOTAL is %d\n”, result); #pragma omp taskwait
BS
result
vec
< BS
Serialization
OmpSs: Concurrent
sum sum sum sum
...
BS
vec
... atomic access to total
double vec[N]; double result; for (int j; j<N; j+=BS){
atual_size = (N- j> BS ? BS: N-j); #pragma omp task in (vec[j;actual_size]) concurrent(result) { double local_result=0.0;
for (int count = 0; count < actual_size; count ++) local_result += vec [j++] ;
#pragma omp atomic result += local_result;
} } #pragma omp task input (result)
printf (“TOTAL is %d\n”, result); #pragma omp taskwait
OmpSs: Commutative
sum
sum
sum
sum
...
BS
vec
...
Tasks executed out of order but not concurrently
for (int j; j<N; j+=BS){ actual_size = (N- j> BS ? BS: N-j); #pragma omp task in (vec[j;actual_size]) commutative(result)
for (int count = 0; count < actual_size; count ++, j++) result += vec [j] ;
} #pragma omp task input (result)
printf (“TOTAL is %d\n”, result); #pragma omp taskwait
No mutual exclusion required
OmpSs support of ISA heterogeneity
Target directive – Source code parsing and backend invocation – The compiler parses the specific syntax of that device and hands the
code over to the appropriate back end compiler
#pragma omp target device (smp | cuda | opencl) – smp
• Backend compiler: gcc, icc, xlc, … – CUDA:
• Mercurium parses cuda • Backend compiler: nvcc
– OpenCL • Backend compiler selected at runtime
Only kernel in CUDA Runtime takes care of memory allocation, data transfers, task scheduling, synchronization,…
#pragma omp target device(cuda) copy_deps ndrange(2,NB,NB,16,16) #pragma omp task inout([NB*NB]C) in([NB*NB]A,[NB*NB]B) __global__ void Muld(REAL* A, REAL* B, int wA, int wB, REAL* C,int NB);
OmpSs@CUDA matmul
NB
NB
DIM
DIM
NB
NB
void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA, REAL **tileB, REAL **tileC )
{ int i, j, k; for(i = 0;i < mDIM; i++) for (k = 0; k < lDIM; k++) for (j = 0; j < nDIM; j++) Muld(tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB, tileC[i*nDIM+j],NB); }
#include "matmul_auxiliar_header.h" // Thread block size #define BLOCK_SIZE 16 // Device multiplication function called by Mul() // Compute C = A * B // wA is the width of A // wB is the width of B __global__ void Muld(REAL* A, REAL* B, int wA, int wB, REAL* C,
int NB) { // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; // Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; …
#define BLOCK_SIZE 16 __constant int BL_SIZE= BLOCK_SIZE; #pragma omp target device(opencl) copy_deps ndrange(2,NB,NB,BL_SIZE,BL_SIZE) #pragma omp task in([NB*NB]A,[NB*NB]B) inout([NB*NB]C) __kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB, __global REAL* C, int NB);
OmpSs@OpenCL matmul NB
NB
DIM
DIM
NB
NB
void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA,
REAL **tileB,REAL **tileC ) { int i, j, k; for(i = 0;i < mDIM; i++) for (k = 0; k < lDIM; k++) for (j = 0; j < nDIM; j++) Muld(tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB, tileC[i*nDIM
+j],NB); }
#include "matmul_auxiliar_header.h" // defines BLOCK_SIZE // Device multiplication function // Compute C = A * B // wA is the width of A // wB is the width of B __kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB,
__global REAL* C, int NB) { // Block index, Thread index int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); // Indexes of the first/last sub-matrix of A processed by the
block int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; ...
Use __global for copy_in/copy_out arguments
OmpSs: support to multiple versions int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) matmul_tile( C[i][j], A[i][k], B[k][j], BS); }
#pragma omp target device (smp) copy_deps #pragma omp task inout([BS*BS]C) in([BS*BS]A, [BS*BS]B) void matmul_tile (float *A, float *B, float *C , int BS) { int i, j, k; for (i = 0; i < BS; i++) for (j = 0; j < BS; j++) for (k = 0; k < BS; k++) { C[i*BS+j] += A[i*BS+k] * B[k*BS+j]; } } #pragma omp target device(cuda) copy_deps implements(matmul_tile) #pragma omp task inout([BS*BS]C) in([BS*BS]A, [BS*BS]B) void matmul_tile_cuda (float *A, float *B, float *C, int BS) { int hA, wA, wB; hA = NB; wA = NB; wB = NB;
dim3 dimBlock, dimGrid; dimBlock.x = BS; dimBlock.y = BS; dimGrid.x = (wB / dimBlock.x); dimGrid.y = (hA / dimBlock.y);
Muld <<<dimGrid, dimBlock>>> ( A, B, wA, wB, C ); } #pragma omp target device(opencl) copy_deps\ implements(matmul_tile) ndrange(2,NB,NB,BL_SIZE,BL_SIZE) #pragma omp task inout([BS*BS]C) in([BS*BS]A,[BS*BS]B) __kernel void Muld( __global REAL* A, __global REAL* B, int wA, int wB, __global REAL* C, int BS);
20
OmpSs: support to multiple versions
Task versions Data transfers
OmpSs @ Cluster
21
void fft_round( long N_SQRT, long FFT_BS, fftw_complex (*A)[N_SQRT][N_SQRT], fftw_complex (*B)[N_SQRT][N_SQRT], char *plan, size_t plan_size ) { long innerBs = ( FFT_BS / _TARGET_THDS ); long restInnerBs = ( FFT_BS % _TARGET_THDS ); for (long J=0; J<N_SQRT; J+=FFT_BS) { #pragma omp target device(smp) copy_deps #pragma omp task firstprivate(N_SQRT, FFT_BS, J, innerBs, restInnerBs) inout( (*A)[J;FFT_BS ][0;N_SQRT])\ in( [plan_size] plan ) {
... fftw_complex (*b)[N_SQRT][N_SQRT] = malloc( N_SQRT * FFT_BS * sizeof( fftw_complex ));
for (long i=J; i<J+FFT_BS; i =i + ( innerBs + ((((i-J)/myInnerBs)< restInnerBs)?1:0))){ #pragma omp task firstprivate(myN_SQRT, i, J, myInnerBs, my_plan, myRestInnerBs )
{ for (long j=i;j<(i+(myInnerBs+(((i-J)/myInnerBs)<myRestInnerBs?1:0)))&&j< myN_SQRT; j++){
HPCC_zfft1d( my_plan->n, &(*myA1)[j][0], &(*b)[j-J][0], -1, my_plan ); }
} } #pragma omp taskwait noflush free( b ); } } }
è Focus on support distributed architectures è Same code, with nesting better suited for hierarchy
Hybrid MPI/OmpSs
Overlap communication/computation Extend asynchronous data-flow execution to outer level
è Focus on adoption by plethora of codes in MPI
… for (k=0; k<N; k++) {
if (mine) {
Factor_panel(A[k]);
send (A[k])
} else {
receive (A[k]);
if (necessary) resend (A[k]);
}
for (j=k+1; j<N; j++)
update (A[k], A[j]);
…
#pragma omp task inout(A[SIZE]) void Factor_panel(float *A); #pragma omp task in(A[SIZE]) inout(B[SIZE]) void update(float *A, float *B);
#pragma omp task in(A[SIZE]) void send(float *A); #pragma omp task out(A[SIZE]) void receive(float *A); #pragma omp task in(A[SIZE]) void resend(float *A);
P0 P1 P2
23
Dynamic Load Balancing: MPI/OmpSs + LeWI
Automatically achieved by the runtime
– Load balance within node – Fine grain. – Complementary to application
level load balance. – Leverage OmpSs malleability
LeWI: Lend When Idle – An MPI process lends its CPUs
when inside a blocking MPI call – Another MPI process in the
same node can use the lent CPUs to run with more threads.
– When the MPI call is finished the MPI process retrieves it’s cpus
Unbalanced Application
MPI 0 MPI 1
MPI call
MPI call
Unbalanced Application with LeWI
MPI 0 MPI 1
MPI call
MPI call
OmpSs infrastructure: Mercurium Compiler
Recognizes constructs and transforms them to calls to the runtime Manages code restructuring for different target devices – Device-specific handlers – May generate code in a
separate file – Invokes different back-end
compilers • gcc, icc, xlc… for regular
code • nvcc for NVIDIA
C/C++/Fortran
OmpSs infrastructure: The NANOS++ Runtime
Nanos++ – Common execution
runtime (C, C++ and Fortran)
– Target specific features – Task creation, dependency
management, resilience, …
– Task scheduling (BF, Cilk, Priority, Socket, …)
– Data management: Unified directory/cache architecture
• Transparently manages separate address spaces (host, device, cluster)…
• … and data transfer between them
26
OmpSs behaviour
int main () { for (…) { createWD(…); } wait_completion(); … }
#pragma omp task … Mercurium
Host code
Device code
Native compilers (gcc, nvcc, …)
Application binary
Scheduling
SMP
SMP
GPU
Cluster
Data directory
GPGPU
Remote node
Mercurium C/C++ source-to-source compiler
Nanos++ run-time
Some results: OmpSs @ SMP
2x Intel SandyBridge-EP E5-2670/1600 8-core at 2.6 GHz
29 29
OmpSs @ Cluster
FFT performance (16k x 16k complex elements) Peak performance on par with the MPI implementation
30
MPI/OmpSs
Scalapack: Cholesky factorization – Example of the issues in porting
legacy code – Demonstration that it is feasible – Synchronization tasks to emulate
array sections behavior • Overhead more than compensated
by flexibility – The importance of scheduling
• ~ 10% difference in global performance
– Some difficulties with legacy codes • Structure of sequential code • Memory allocation
31
What is a DSL?
Domain Specific Language – Language tailored to solve problems in one domain – The size of the domain can widely vary
• Data query (SQL) • Numerical computing (Matlab) • Statistics (R)
32
What is a DSL for HPC?
Domain Specific Language – Language tailored to solve problems in one domain – The size of the domain can widely vary
• Data query (SQL) • Numerical computing (Matlab) • Statistics (R)
– The DSL has additional performance requirements • To solve “interesting” problems it must efficiently run on a HPC system
33
DSL advantages & drawbacks
ü Language very close to problem domain – Best programmer productivity
• Easy to understand by domain experts – Even without previous knowledge of the language!
• Easy to map and solve domain problem • Easy to maintain and future-proof!
– Language fully decoupled from hardware
– Bad/Good/Best performance
x The development of a DSL is only justified when large community behind – Otherwise, no way to amortize the development cost of the DSL
infrastructure
x The complexity of developing a HPC DSL is huge! – DSL Compiler, tools, optimizer, distributed parallel runtime system, ...
x The complexity of developing a DSL is high – DSL Compiler, tools, ...
34
BSC goal – CS department
Develop a framework that can be shared by several DSLs – Compiler Framework
• Scala • Lightweight Modular Staging (LMS) from EPFL • Dataflow-superscalar framework DFL from BSC
– Runtime Framework • OmpSs (Mercurium & Nanox++) • OpenCL • MPI (future work)
35
BSC - CASE expertise on Partial Differential Equations and HPC – Alya Red simulation environment
Domain: Convection-Diffusion-Reaction equations – Well know domain (by the CASE people) – Several implementations already available in C and Fortran – First design decisions of the DSL
• Level of abstraction • Types • Operators
BSC - CS / CASE collaboration
36
Simple and high level syntax – High level constructs that directly associate with domain knowledge – Efficient development/maintenance cycle
High performance computing for free (for the end user) – Ability to solve large complex problems with 20 lines of clean, simple code
SAIPH: a DSL for solving CDR equations
37
def KFun(xp: Float, yp: Float, zp: Float) = { if (zp > 18.75) 0.02 else 0.15
} val c = Cartesian(12.5, 25.0, 37.5) val temp = Unknown(c) val plane = Dirichlet(lowXZ of c, temp, 400) val hv = Vector(0.5, 0.5, 0.5) val pre = PreProcess(nsteps = 100000, deltaT = 0.125, h = hv)(plane) val K = KFun _ val diffusion = K * lapla(temp) - dt(temp) val post = snapshoot each 100 steps solve(pre)(post) equation diffusion to "diffusion"
CDR: Example 1 – Pure diffusion phenomena
Runs on a system with a GPU: 10.000 time steps in 7 seconds
38
CDR: Example 1 – Pure diffusion phenomena
39
Underlying Technologies
Front end - Compile the program with the LMS Library and the compiler implementation together
Middle end - 1st stage - Domain Specific Opt. - LMS IR generation
Back end - 2nd stage - DFL code + OpenCL kernels
CDRs Embedded Compiler (LMS)
Scala Virtualized Compiler Diffusion.sph
Diffusion.class
Host-side CodeGen
DFL Compiler (LMS)
Diffusion.cpp
Diffusion.dfl DiffusionEquation.rsveq
OmpSs
Accelerator-side CodeGen
Equation Stencil Compiler (LMS) DiffusionKernels.cl
40
CDR: Example 1 – Pure diffusion phenomena
CDRs generates – Two OpenCL kernels (tasks) – One I/O task – The initialization code + body of the application + OmpSs pragmas
OmpSs runtime orchestrates the execution – Schedules task based on data dependencies – Manages data transfers between host and GPU
Input/output tasks
GPU computation tasks
41
Translation process def KFun(xp: Float, yp: Float, zp: Float) = { if (zp > 18.75) 0.02 else 0.15 } // Defining mesh and conditions val c = Cartesian(12.5, 25.0, 37.5) val temp = Unknown(c) val plane = Dirichlet(lowXZ of c, temp, 400) // Defining preprocess val hv = Vector(0.5, 0.5, 0.5) val pre = PreProcess(nsteps = 150000, deltaT = 0.125, h = hv)(plane) // Defining equation val K = EqField(KFun _) val diffusion = K * lapla(temp) - dt(temp) // Defining postprocess val post = snapshoot each 5000 steps solve(pre)(post) equation diffusion to "diffusion"
Diffusion.sph
... val solveStepx31 = Kernel(kc_x31, "solveStepx31")(In, In(3), In, In, In, In, In(13), InOut(x23), InOut(x23), In(x23), In(x23), In(6), In(6)) val expandBounds = Kernel(kc_x31, "expandBounds")(In, In, In, InOut(x23), In(6), In) ... (4 until x26) foreach { i => (4 until 5) foreach { j => (4 until x25) foreach { k => x24(i*x17*x13+j*x13+k) = 400.0000000000f } } } ... (0 until 150000) foreach { i_x31 =>
if (i_x31 % 2 == 0) { solveStepx31(0.1250000000f, x0, x13, x17, x21, 4, coeffs_x31, x24, x24_back_1,
x29, x31_dirich_mask_unk0, x31_neumann_mask_unk0, x31_neumann_vals_unk0) using ndr_x31 expandBounds(x13, x17, x21, x24, x31_periodics, 4) using ndr_x31 if ((i_x31+1) % 5000 == 0) { () Task(x24, x0)(In(x23), In(3)) { writeVTI(x24, x13, x17, x21, "diffusion", x0, 4, (i_x31+1)/5000) } } }
... } taskwait
Diffusion.dfl
42
__kernel void solveStepx31( float dt, __global float *H, int dx, int dy, int dz, int halo, __global float *coeffs, __global float *unk0_0, __global float *unk0_1, __global float *field0, __global int *dirich_mask0, __global int *neumann_mask0, __global float *neumann_vals0) {
int i = get_global_id(2); int j = get_global_id(1); int k = get_global_id(0);
if (i < halo || j < halo || k < halo || i >= (dz-halo) || j >= (dy-halo) || k
>= (dx-halo)) return; int neum0DerType = 0; int neum0Direction; float neum0Value; if (i == halo) {if (neumann_mask0[0] > 0) { neum0DerType = neumann_mask0[0];
neum0Direction = 0; neum0Value = neumann_vals0[0]; } }
if (j == halo) ... } int idx = i*dx*dy + j*dx + k; float x1 = unk0_1[idx]; float x3 = unk0_1[idx]; float x2 = field0[idx]; float x4 = sosd(&unk0_1[idx], H, dx, dy, dz, neum0DerType, neum0Direction, neum0Value); float x5 = x2 * x4; float x6 = x5 - 0.0f; if (dirich_mask0[idx] == 0) { unk0_0[idx] = unk0_1[idx] + x6*dt; } else unk0_0[idx] = unk0_1[idx]; }
__kernel void expandBounds( int dx, int dy, int dz, __global float *unk, __global int *periodics, int halo) { int i = get_global_id(2); int j = get_global_id(1); int k = get_global_id(0); if (i > (dz-1) || j > (dy-1) || k > (dx-1)) return; int idx = i*dx*dy + j*dx + k; int di = i; int dj = j; int dk = k; if (i < halo) di = periodics[0]; else if (i >= (dz-halo)) di = periodics[3]; if (j < halo) dj = periodics[1]; else if (j >= (dy-halo)) dj = periodics[4]; if (k < halo) dk = periodics[2]; else if (k >= (dx-halo)) dk = periodics[5]; if (i != di || j != dj || k != dk) unk[idx] = unk[di*dx*dy + dj*dx + dk]; }
Diffusion.cl
43
... for(int x126=0; x126 < 150000; x126++) {
int x127 = x126 % 2; bool x128 = x127 == 0; if (x128) { #pragma omp target device(opencl) ndrange(3, 0, 0, 0, x8, x12, x16, 16, 16, 4) copy_deps #pragma omp task in([3] xa1, [13] xa6, [x18] xa9, [x18] xa10, [6] xa11, [6] xa12) inout([x18]
xa7, [x18] xa8) __kernel void solveStepx31(float xa0, __global float* xa1, int xa2, int xa3, int xa4, int xa5,
__global float* xa6, __global float* xa7, __global float* xa8, __global float* xa9, __global int* xa10, __global int* xa11, __global float* xa12);
solveStepx31(0.125f, x4, x8, x12, x16, 4, x112, x19, x89, x60, x90, x101, x104); #pragma omp target device(opencl) ndrange(3, 0, 0, 0, x8, x12, x16, 16, 16, 4) copy_deps #pragma omp task in([6] xa4) inout([x18] xa3) __kernel void expandBounds(int xa0, int xa1, int xa2, __global float* xa3, __global int* xa4,
int xa5); expandBounds(x8, x12, x16, x19, x111, 4);
int x133 = x126 + 1; int x134 = x133 % 5000; bool x135 = x134 == 0; if (x135) { int x136 = x133 / 5000; #pragma omp target device(smp) copy_deps #pragma omp task in([x18] x19, [3] x4) writeVTI(x19, x8, x12, x16, string("diffusion"), x4, 4, x136); }
... #pragma omp taskwait
Diffusion.cpp
44
CDR: Example 1 – Pure diffusion phenomena
CDRs generates – Two OpenCL kernels (tasks) – One I/O task – The initialization code + body of the application + OmpSs pragmas
OmpSs runtime orchestrates the execution – Schedules task based on data dependencies – Manages data transfers between host and GPU
Input/output tasks
GPU computation tasks
45
def hotCube(cx: Float, cy: Float, cz: Float, edgeSize: Float) (xp: Float, yp: Float, zp: Float) = { if (xp >= cx - edgeSize && xp <= cx + edgeSize && yp >= cy - edgeSize && yp <= cy + edgeSize &&
zp >= cz - edgeSize && zp <= cz + edgeSize) Some(10) else Some(5)
} val c = Cartesian(25, 50, 75) val temp = Unknown(c) val cube = Source(hotCube(12.5, 25, 37.5, 6) _, temp) val hv = Vector(1, 1, 1) val pre = PreProcess(nsteps = 500, deltaT = 1, h = hv)(cube)(PeriodicHighZ) val v = Vector(0, 0, 1) val convection = dt(temp) + grad(temp) * v solve(pre)(flush) equation convection to "convection"
CDR: Example 2 – Pure convection phenomena
Stabilization scheme done internally by CDR
46
CDR: Example 2 – Pure convection phenomena
The numerical scheme do not introduce artificial diffusion due to the stabilization.The cubic form is preserved
val v = Vector(0, 0, 1) val convection = dt(temp) + grad(temp) * v solve(pre)(flush) equation convection to "convection"
Stabilization scheme done internally by CDR
47
Incomplete code def CDef(x: Rep[Float], y: Rep[Float], z: Rep[Float]) = {
if (x >= 300 && x <= 400 && y >= 300 && y <= 400) (1700*1700) else (2000*2000)
} val c = Cartesian(500, 500, 9) val pressure = Unknown(c) val waveSource = PointSourceSource(250,250,5)(rickerWalet(20)_,pressure) val hv = Vector(1, 1, 1) val pre = PreProcess(nsteps = 50000, deltaT = 0.003333, h = hv)(waveSource) val C = CDef _ val wavePropagation = C * lapla(pressure) – dt2(pressure) val post = snapshoot each 10 steps solve(pre)(post) equation wavePropagation to ”wave”
CDR: Example 3 – Acoustic wave equation in a heterogeneous env.
48
CDR: Example 3 – Acoustic wave equation
49
Conclusions
OmpSs is a task based programming model – Supports asynchronous task execution model – Supports heterogeneity and distributed memory – Extends OpenMP
• Some OmpSs characteristics are now in the standard, i.e. Dependence clauses
• Continuous feedback to the standardisation body – OmpSs can improve MPI behavior, by enabling the overlap of
communication with computation
OmpSs is not just a research project – Whole team of researchers, developers and PhD students contributing – Distributed as open source, in a pseudo-professional way (i.e., git
repository, ticketing…) – Open to collaborations!
50
Conclusions
Enabling developers to be it is comfort zone when programming for Exascale computing is still a challenge Efforts like DSLs with powerful runtimes such as OmpSs seems to be a good strategy – Offer a language tailored to solve problems in one domain – Run efficiently on a HPC system
Future work – Further develop/optimize the environment – Combine it with MPI – Continue optimizing runtime for further scaling, fault tolerance, …
• Contact: pm-tools@bsc.es • Source code available from http://pm.bsc.es/ompss/
Jesus Labarta Eduard Ayguade Rosa M. Badia Xavier Martorell Vicenç Bertran Alex Duran (Intel) Roger Ferrer Xavier Teruel Javier Bueno Judit Planas Sergi Mateo Guillermo Miranda
Florentino Sainz Victor Lopez Marta Garcia Josep M. Perez Omer Subasi Javier Arias Harald Servat Judit Gimenez Kallia Chronaki Alejandro Fernández …
Contributors
www.bsc.es
Thank you!
52
Recommended