42
Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory Leopold Grinberg, IBM/Research, T.J. Watson Center [email protected]

Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

Speaker name, TitleCompany/Organization Name

Join the Conversation #OpenPOWERSummit

Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

Leopold Grinberg, IBM/Research, T.J. Watson Center

[email protected]

Page 2: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

for (i=0; i < N; ++i)

data[i] = …

2

42 TF and ~ 6TB/s memory BW

Page 3: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

3

AC922

Memory: system main memory, GPU’s HBM.

Concurrency: ~1M threads running on the

CPUs and GPUs.

Data and memory management Execution policy, and expressing parallelism

Designing portable and performance portable code

Unified Addressing Compiler Directive based programming

Examples: • Memory/data management with OpenMP4.5 directives and

Unified Addressing• Nested data structures, std::vector• Nested parallelism• Examples from CORAL-1 benchmarks (LULESH/etc..)• Asynchronous execution

Hardware

Challenge

Strategy

Value

Page 4: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

4

AC922: POWER9 + V100 + NVLink 2.0

V-100: 80 SMs; up to 2048 threads per SM; up to 32 CUDA blocks per SM

POWER9: 22 cores ; 4 hardware threads per core; NVLink 2.0; PCIe 4

Page 5: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

5

Challenge: Keeping 1M Threads on a Single Node Busy

(22*4*2=) 176 CPU threads + (80*2048*6=) 983,040 GPU threads

Page 6: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

6

HBM2 HBM2 HBM2

DDR4

HBM2 HBM2

DDR4

HBM2

Challenge: Managing Multiple Memories and ~0.5 TB of data

Page 7: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

7

Programming Languages and Compilers on OpenPOWER

Key Features

• Direct access to the GPU instruction set

• When leveraging NVIDIA GPUs, generally achieves best performance

• Compilers: XL Fortran, NVCC, PGI CUDA Fortran

• Host compilers: GCC, XL, PGI, CLANG

• High level directives for heterogeneous CPU + NVIDIA GPU systems

• Platform/accelerator portable

• Fallback execution for safety

• Compilers: IBM XL, LLVM/Clang compiler, GCC

• High level directives for heterogeneous CPU + NVIDIA GPU systems

• Directive based parallelization for accelerator device

• Compilers: PGI, GCC

CUDA

Page 8: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

8

Why use OpenMP 4.x ?

The ultimate goal for developers using OpenMP4.0 and beyond is to achieve:

a) portability

b) performance portability

while using the same source code and compiling it on different platforms.

OpenMP4.5 allows incremental transition of applications:non-threaded codes can be first parallelized using OpenMP directives (if algorithm allows parallelization) tested on the host (CPU) and then offloaded to the device (GPU)

for (i=0; i<N;i++)y[i] = a*x[i]+y[i]

#pragma omp parallel forfor (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

#pragma omp target teams distribute parallel for if(0)for (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N]) if(target:1)for (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

Page 9: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

9

Code comments effort No offloading With offloading comment

LULESH XLC, BW limited

2-3 days FOM: 17,000 / node FOM: 196,000 / node 27 nodes

AMG2013 XLC, Read BW limited, cuSparse

< week FOM: 0.7e+08 / node FOM: 9.4e+08 / node 1 node

HPCG** CLANG, Read BW limited

3 weeks FOM: 15.8 FOM: 197 1 node

Quicksilver “GPU hostile code”, load balancing issues… tracking kernel time.

1 mo. code restructuring. 2 days porting

35 s 26.7 s (2CPUs + 4 GPUs) 1 node

Opacitylibrary*

Table lookups, integer arithmetic.

~3 weeks Speedup: 1x Up to 4x with data transfersup to 30x with data in GPU

1 P8 vs. 1 P-100

Why use OpenMP 4.x ?

12x

12x

12x

Simulations on IBM Minsky nodes (2 POWER8 CPUs and 4 P-100 GPUs)*Joint work with LLNL and IBM; **Sequential Gauss-Seidel has been replaced with multi-colored Gauss-Seidel

Page 10: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

10

Managing memory

Defining execution

space

Managing data

Nesting parallel regions

Coarse-grain parallelism

Fine-grain parallelism

Intra-node communication

Inter-node communication

OPENMP 4.5

Programming with OpenMP 4.5

Page 11: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

11

Challenge: managing memory and data with OpenMP4.5

HBM2 HBM2

DDR

HBM2

Managing memory:• Memory allocation/deallocation• Use of memory pools • Use of Unified Addressing• Preventing page migration

Managing data:• Replication of data on different memories + synchronization• Placing data in buffers provided by memory pools • Use of Unified Addressing • [Random] switching execution between HOST and DEVICE

requires careful synchronization• Using [same] data by HOST and DEVICES concurrently

and/or in stages

Page 12: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

12

Managing memory and data

Memory/data management using

OpenMP directives and API

Memory/data management using Unified Addressing

Memory/data management while mixing

OpenMP directives/API and Unified Addressing

Page 13: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

13

Managing memory and data using OpenMP 4.5Managing memory and data is typically the first task that developers have to tackle while porting applications from homogeneous to heterogeneous system.

OpenMP4.5 provides with a number of options:

1. Use of directives (map, enter data, release, delete,….)

2. Use of OpenMP API calls (omp_target_alloc, omp_target_memcopy, …)

3. Mixing 1 and 2 by allocating memory on the device with omp_target_alloc or CUDA APIs and on the host with malloc, associating pointers using omp_target_associate_ptr and applying directives (map, update, …)

Page 14: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

14

Managing memory and data using OpenMP 4.5* -data replication model

“…The syntax of the map clause is as follows:map([ [map-type-modifier[,]] map-type : ] list)where map-type is one of the following:..“to

from

tofrom

alloc

release

delete

map-type-modifier is always.

Allocate* device memory;

copy* from CPU to GPU; deallocate* device memoryallocate* device memory;

copy* from GPU to CPU;deallocate* device memory allocate* device memory;

copy* from CPU to GPU; {kernel};

copy from GPU to CPU; deallocate* memoryallocate* device memory

on GPU

reduce reference count; and possibly delete…

deallocate* device memory

*IBM’s implementation

Page 15: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

15

Managing Memory and Data: Example

double *x, *y;int N = 3238289;

x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)

x[i] = 0.1*i;

#pragma omp target teams distribute parallel for map(to:x[0:N]) map(from:y[0:N])for (i = 0; i < N; ++i)

y[i] = sin(x[i])*cos(x[i]);x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory

Page 16: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

16

double *x, *y;int N = 3238289;

x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)

x[i] = 0.1*i;

#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])

#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i)

y[i] = sin(x[i])*cos(x[i]);

#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])

x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory

Managing Memory and Data: Example

OpenMP runtime will automatically detect if

arrays x and y are mapped

Page 17: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

17

Managing Memory and Data: Example

double *x, *y;int N = 3238289;

x = new double[N]; y = new double[N]; for (i = 0; i < N; ++i) x[i] = 0.1*i;

#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])

#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i) y[i] = sin(x[i])*cos(x[i]);

functionA(x,y,N);

#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])

functionA(double *x, double *y, int N){#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N])for (i = 0; i < N; ++i) y[i] = 2.0*x[i] + y[i];

}

x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory

Use case for reference counterssince x and y are mapped, maps here will be

effectively noops

Page 18: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

18

Developers working on codes for simulations on CPUs and GPUs can also mix CUDA API and OpenMP4.5 directives/API call for memory/data management.

For example, in our implementation of AMG2013 benchmark we use OpenMP4.5 for memory allocation and data initialization, and some kernels, while we use cuSparse library for optimized to GPUs sparse matrix vector multiplications.

#pragma omp target enter data map(to: x_data[0:x_size],y_data[0:y_size])…#pragma omp target data use_device_ptr(x_data[0:x_size],y_data[0:y_size])

{cusparseDcsrmv(cu_spmv_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, num_rows, num_cols, nnz,

&alpha, cu_spmv_descr, d_A_data, d_A_i, d_A_j, x_data, &beta, y_data);}

Warning: Making code portable to different types of architectures will require

additional work! cuSparse is not available on systems without NVIDIA GPUs.

Managing memory and data using OpenMP 4.5:interoperability with NVIDIA libraries

Page 19: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

19

Managing memory/data: deeply nested data structures

Page 20: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

20

Class MC_Cell_state: qs_vec<double> _number_dencity

qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _taskClass MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh

qs_vector<MC_Domain> domain

Class MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh

Class MC_Cell_state: qs_vec<double> _number_dencity

qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _task

Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total

Class MC_Mesh_Domain:qs_vec<int> _nbrDomainGidqs_vec<int> _nbrRank

qs_vec<MC_Vector> _nodeqs_vec<MC_Facet_Adjacency_Cell> _cellConnectivityqs_vec<MC_Facet_Geometry_Cell> cell_geometry

Class MC_Vector { double; double double}

Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total Class : task_precomputed_multigroup_macroscopic_cross_sections_type

qs_vector<double> _total

Class MC_Vector { double; double double}Class MC_Vector { double; double double}

Class MC_Vector { double; double; double}

Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point

Class MC_Facet_AdjacencySubfacet_Adjacency subfacet

Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point

Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point

Class MC_Facet_AdjacencySubfacet_Adjacency subfacet

Class MC_Facet_AdjacencySubfacet_Adjacency subfacet

Class Subfacet_AdjacencyMC_Subfacet_Adjacency_Event::Enum

event; MC_Location current;MC_Location adjacent;

Class MC_Location { int; int; int}

Class MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facet

Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}

Page 21: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

21

Managing memory/data: deeply nested data structures

a {*y, size} y[0:N]

a {*y, size} y[0:N]

CPU Memory

GPU Memory

Page 22: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

22

Managing memory/data: deeply nested data structures

#pragma omp target enter data map(to:a[0:1])#pragma omp target enter data map(to:a->y[0:N])#pragma omp target{

a->y[3] += …}

#pragma omp target exit data map(release:a->y[0:N])#pragma omp target exit data map(release:a[0:1])

#pragma omp target data map(to:y[0:n], a[0:1]) #pragma omp target{

a->y = y;}#pragma omp target{

a->y[3] += …}

a {*y, size} y[0:N]

a {*y, size} y[0:N]

CPU Memory

GPU Memory

Page 23: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

23

Managing memory/data: Unified Addressing

Example: std::vector

Page 24: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

24

template <class T>

struct UMAllocator {

typedef T value_type;

UMAllocator() {}

template <class U> UMAllocator(const UMAllocator<U>& other);

T* allocate(std::size_t n)

{

T* ptr;

#ifdef USE_CUDA_MANAGED

cudaMallocManaged(&ptr, n*sizeof(T));

#else

ptr = (T*) malloc(n*sizeof(T));

#endif

return ptr;

}

void deallocate(T* p, std::size_t n)

{

#ifdef USE_CUDA_MANAGED

cudaFree(p);

#else

free(p);

#endif

}

};

Managing memory/data: deeply nested data structures

std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);

Page 25: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

25

template <class T>

struct UMAllocator {

typedef T value_type;

UMAllocator() {}

template <class U> UMAllocator(const UMAllocator<U>& other);

T* allocate(std::size_t n)

{

T* ptr;

ptr = (T*) malloc(n*sizeof(T));

#ifdef USE_ATS

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0); //required today performance reason

cudaDeviceSynchronize();

#endif

return ptr;

}

void deallocate(T* p, std::size_t n)

{

free(p);

}

};

std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);

We expect that in the future prefetching will be handled by the OS, and CUDA API will not be required

Managing memory/data:deeply nested data structures

Will not work on systems not supporting ATS

Page 26: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

26

Managing memory/data on systems with ATS*

(address translation service)

*Part of this presentation includes IBM’s extensions for OpenMP4.5 and features of OpenMP5.0 already implemented in compilers supporting OpenMP4.5

Page 27: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

27

int main(){

int N=20;

double *data = new double[N];

omp_set_default_device(0);

#pragma omp target teams distribute parallel for map(from:data[0:N])

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

printf("data[%d] = %g\n",i,data[i]);

delete [] data;

return 0;

}

Managing memory/data on systems with ATS

Page 28: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

28

int main(){

int N=20;

double *data = new double[N];

omp_set_default_device(0);

#pragma omp target teams distribute parallel for is_device_ptr(data)

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

printf("data[%d] = %g\n",i,data[i]);

delete [] data;

return 0;

}

Systems with ATS enabled

Managing memory/data on systems with ATS

Page 29: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

29

int main(){

int N=20;

double *data = new double[N];

omp_set_default_device(0);

#pragma omp target teams distribute parallel for //is_device_ptr(data)

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

printf("data[%d] = %g\n",i,data[i]);

delete [] data;

return 0;

}

Systems with ATS enabled and

export XLSMPOPTS=TARGETMEM=UIMPLICIT

Managing memory/data on systems with ATS

Page 30: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

30

int main(){

int N=20;

double *data, *data2;

omp_set_default_device(0);

data = new double[2];

data2 = new double[N];

#pragma omp target teams distribute parallel for map(from:data2[0:N])//is_device_ptr(data)

for (int i = 0; i < N; ++i)

data2[i] = data[i%2] + i*0.1;

delete [] data;

delete [] data2;

return 0;

}

Systems with ATS enabled and

export XLSMPOPTS=TARGETMEM=UIMPLICIT

Managing memory/data on systems with ATS

Here “map” is not ignored,memory for data2 is allocated on the device and content of data2 is being copied from device to host

Page 31: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

31

subroutine foo(a,n)

real*8, dimension(n) :: a

!$omp target teams distribute parallel do

do i=1,N ; a(i)=0.1*i; end do

end

program test_implicit_ats

integer, parameter :: N=20

real*8, dimension(:), allocatable :: data

allocate(data(N))

call foo(data,N)

print *,data(10)

end

nvprof ./a.out

1.9520us 160B 78.170MB/s Pinned Device Tesla V100-SXM2 [CUDA memcpy HtoD]

1.6000us - Tesla V100-SXM2 __xl_foo_l3_OL_1 [156]

2.0480us 160B 74.506MB/s Device Pinned Tesla V100-SXM2 [CUDA memcpy DtoH]

export XLSMPOPTS=TARGETMEM=UIMPLICITnvprof ./a.out

7.1040us - - - Tesla V100-SXM2 __xl_foo_l3_OL_1 [150]

Managing memory/data on systems with ATS

Fortran

Contributed by Lixiang Luo, IBM Research

Page 32: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

32

Simulations with 1 MPI rank/GPU Simulations with 2 MPI ranks/GPU [+MPS]

#ifdef USE_ATS

ptr = (T*) malloc(n*sizeof(T));

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);

cudaDeviceSynchronize();

#else …

#ifdef USE_CUDA_MANAGED

cudaMallocManaged(&ptr, n*sizeof(T));

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);

cudaDeviceSynchronize();

#else …

# MPI ranks

#nodes FOM/node: CUDA Managed

FOM: ATS

1000 166.7 312,782 327,581

1728 288 308,760 328,513

# MPI ranks

#nodes FOM/node: CUDA Managed

FOM: ATS

1000 83.3 332,393 358,852

1728 144 331,248 358,538

LULESH: performance with ATS and CUDA Managed Memory

Page 33: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

33

OpenMP: Nested Parallel regions on CPUs and GPUs

Page 34: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

34

Nested parallelism + concurrent execution on all devices

Page 35: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

35

Nested parallelism + concurrent execution on all devices int main(){

double *x, *y;double DEVICE_FRACTION=0;int num_devices, i, chunk, j_start, N = 1024*1024*10;bool USE_DEVICE;x = new double[N]; y = new double[N];//enable nested parallelismomp_set_nested(1);//get number of devicesnum_devices = omp_get_num_devices();if (num_devices>0) DEVICE_FRACTION=0.9;#pragma omp parallel for num_threads(num_devices+1) private(chunk, j_start, USE_DEVICE)for( i < (num_devices+1); ++i){

if (i < num_devices){ omp_set_default_device(i);chunk = DEVICE_FRACTION * N / num_devices;j_start = chunk*I;USE_DEVICE=true;

} else {chunk = N; //defaultj_start = 0; //defaultUSE_DEVICE=false; //defaultif (num_devices > 0){

j_start = (DEVICE_FRACTION * N / num_devices) * num_devices;chunk = N – j_start;

}}initialize_x_and_y( x+j_start, y+j_start, chunk, j_start, USE_DEVICE);

}free(x); free(y);return 0;}

void initialize_x_and_y(double *x, double *y, int N, int offset, bool USE_DEVICE){#pragma omp target teams distribute parallel for map(from:x[0:N], y[0:N]) if(target:USE_DEVICE)

for (int i=0; i < N; ++i){x[i] = (offset + i) * 0.001; y[i] = (offset + i) * 0.003;

}}

Page 36: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

36

Nested parallelism: communication in LULESH#pragma omp parallel sections private(pmsg,emsg,cmsg,destAddr){

#pragma omp section{

if (planeMin | planeMax) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm] ;

#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0 ; fi<xferFields; ++fi) { for (Index_t i=0; i<sendCount; ++i) { destAddr[i+sendCount*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;

}

#pragma omp section{

if (rowMin && planeMin && not_planeOnly) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm + emsg * maxEdgeComm] ;

#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0; fi<xferFields; ++fi) { for (Index_t i=0; i<dx; ++i) { destAddr[i + dx*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;

}}…..

Page 37: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

37

Nested parallelism: communication in LULESH

#pragma omp parallel num_threads(2){if (omp_get_thread_num() == 0){

/* evaluate time constraint */CalcCourantConstraintForElems(domain,

domain.regElemSize(r),domain.regElemlist(r),domain.qqc(),domain.dtcourant()) ;

}if (omp_get_thread_num() == (omp_get_num_threads() -

1) ){/* check hydro constraint */CalcHydroConstraintForElems(domain,

domain.regElemSize(r),domain.regElemlist(r),domain.dvovmax(),domain.dthydro()) ;

}

Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)

Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)

Page 38: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

38

Asynchronous execution

Page 39: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

39

void CalcEnergyForElems( …..){ …

#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }

#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {

e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;

}

CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList);

#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){

const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}

void CalcPressureForElems(Real_t* p_new, …. )

#pragma omp target teams … if(target:USE_DEVICE)for (Index_t i = 0; i < length ; ++i) {

Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;

}

#pragma omp target … if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i];

…}

Asynchronous executionin LULESH

Page 40: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

40

void CalcEnergyForElems( …..){ …

#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }

#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {

e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;

}

CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList, dep_flag);

#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){

const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}

void CalcPressureForElems(Real_t* p_new, …. int dep_flag)

#pragma omp target teams … nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0; i < length ; ++i) {Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;

}

#pragma omp target … nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i]; …}

0

50,000

100,000

150,000

200,000

250,000

18

27

163K158K 158K

203K 197K 196K

FOM

(z/

s)

# of Nodes

LULESH (PWR8 + Pascal)

Synch

Asynch

Asynchronous executionin LULESH

Page 41: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

41

[implicit] Placing Data in GPU’s Shared Memory

BLK_SZ is known at compile time.

VAL is team private

Performance.achieved BW is measured as (Nr*Nc*2*8bytes)/(kernel time)BLK_SZ=32we measure ~900GB/s wile using shared memory and ~40GB/s without …

Page 42: Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title Company/Organization Name Join the Conversation #OpenPOWERSummit Porting and Optimizing Applications

42

Acknowledgement:

IBM Compiler and OpenMP-runtime team:

Ettore Tiotto, Tarique Islam, Bardia Mahjour, Zarko Todorovski, Wael Yehia, Rafik Zurob, Wang Chen, Kelvin Li,

Alexandre Eschenberger, George Bercea, Kevin O’Brien

LLNL’s personnel : Riyaz Haque, Tom Scogland