Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title...

Speaker name, TitleCompany/Organization Name

Join the Conversation #OpenPOWERSummit

Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

Leopold Grinberg, IBM/Research, T.J. Watson Center

leopoldgrinberg@us.ibm.com

Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

for (i=0; i < N; ++i)

data[i] = …

42 TF and ~ 6TB/s memory BW

Memory: system main memory, GPU’s HBM.

Concurrency: ~1M threads running on the

CPUs and GPUs.

Data and memory management Execution policy, and expressing parallelism

Designing portable and performance portable code

Unified Addressing Compiler Directive based programming

Examples: • Memory/data management with OpenMP4.5 directives and

Unified Addressing• Nested data structures, std::vector• Nested parallelism• Examples from CORAL-1 benchmarks (LULESH/etc..)• Asynchronous execution

Hardware

Challenge

Strategy

AC922: POWER9 + V100 + NVLink 2.0

V-100: 80 SMs; up to 2048 threads per SM; up to 32 CUDA blocks per SM

POWER9: 22 cores ; 4 hardware threads per core; NVLink 2.0; PCIe 4

Challenge: Keeping 1M Threads on a Single Node Busy

(22*4*2=) 176 CPU threads + (80*2048*6=) 983,040 GPU threads

HBM2 HBM2 HBM2

HBM2 HBM2

Challenge: Managing Multiple Memories and ~0.5 TB of data

Programming Languages and Compilers on OpenPOWER

Key Features

• Direct access to the GPU instruction set

• When leveraging NVIDIA GPUs, generally achieves best performance

• Compilers: XL Fortran, NVCC, PGI CUDA Fortran

• Host compilers: GCC, XL, PGI, CLANG

• High level directives for heterogeneous CPU + NVIDIA GPU systems

• Platform/accelerator portable

• Fallback execution for safety

• Compilers: IBM XL, LLVM/Clang compiler, GCC

• High level directives for heterogeneous CPU + NVIDIA GPU systems

• Directive based parallelization for accelerator device

• Compilers: PGI, GCC

Why use OpenMP 4.x ?

The ultimate goal for developers using OpenMP4.0 and beyond is to achieve:

a) portability

b) performance portability

while using the same source code and compiling it on different platforms.

OpenMP4.5 allows incremental transition of applications:non-threaded codes can be first parallelized using OpenMP directives (if algorithm allows parallelization) tested on the host (CPU) and then offloaded to the device (GPU)

for (i=0; i<N;i++)y[i] = a*x[i]+y[i]

#pragma omp parallel forfor (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

#pragma omp target teams distribute parallel for if(0)for (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N]) if(target:1)for (i=0; i<N;i++)

y[i] = a*x[i]+y[i]

Code comments effort No offloading With offloading comment

LULESH XLC, BW limited

2-3 days FOM: 17,000 / node FOM: 196,000 / node 27 nodes

AMG2013 XLC, Read BW limited, cuSparse

< week FOM: 0.7e+08 / node FOM: 9.4e+08 / node 1 node

HPCG** CLANG, Read BW limited

3 weeks FOM: 15.8 FOM: 197 1 node

Quicksilver “GPU hostile code”, load balancing issues… tracking kernel time.

1 mo. code restructuring. 2 days porting

35 s 26.7 s (2CPUs + 4 GPUs) 1 node

Opacitylibrary*

Table lookups, integer arithmetic.

~3 weeks Speedup: 1x Up to 4x with data transfersup to 30x with data in GPU

1 P8 vs. 1 P-100

Why use OpenMP 4.x ?

Simulations on IBM Minsky nodes (2 POWER8 CPUs and 4 P-100 GPUs)*Joint work with LLNL and IBM; **Sequential Gauss-Seidel has been replaced with multi-colored Gauss-Seidel

Managing memory

Defining execution

Managing data

Nesting parallel regions

Coarse-grain parallelism

Fine-grain parallelism

Intra-node communication

Inter-node communication

OPENMP 4.5

Programming with OpenMP 4.5

Challenge: managing memory and data with OpenMP4.5

HBM2 HBM2

Managing memory:• Memory allocation/deallocation• Use of memory pools • Use of Unified Addressing• Preventing page migration

Managing data:• Replication of data on different memories + synchronization• Placing data in buffers provided by memory pools • Use of Unified Addressing • [Random] switching execution between HOST and DEVICE

requires careful synchronization• Using [same] data by HOST and DEVICES concurrently

and/or in stages

Managing memory and data

Memory/data management using

OpenMP directives and API

Memory/data management using Unified Addressing

Memory/data management while mixing

OpenMP directives/API and Unified Addressing

Managing memory and data using OpenMP 4.5Managing memory and data is typically the first task that developers have to tackle while porting applications from homogeneous to heterogeneous system.

OpenMP4.5 provides with a number of options:

1. Use of directives (map, enter data, release, delete,….)

2. Use of OpenMP API calls (omp_target_alloc, omp_target_memcopy, …)

3. Mixing 1 and 2 by allocating memory on the device with omp_target_alloc or CUDA APIs and on the host with malloc, associating pointers using omp_target_associate_ptr and applying directives (map, update, …)

Managing memory and data using OpenMP 4.5* -data replication model

“…The syntax of the map clause is as follows:map([ [map-type-modifier[,]] map-type : ] list)where map-type is one of the following:..“to

tofrom

release

delete

map-type-modifier is always.

Allocate* device memory;

copy* from CPU to GPU; deallocate* device memoryallocate* device memory;

copy* from GPU to CPU;deallocate* device memory allocate* device memory;

copy* from CPU to GPU; {kernel};

copy from GPU to CPU; deallocate* memoryallocate* device memory

on GPU

reduce reference count; and possibly delete…

deallocate* device memory

*IBM’s implementation

Managing Memory and Data: Example

double *x, *y;int N = 3238289;

x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)

x[i] = 0.1*i;

#pragma omp target teams distribute parallel for map(to:x[0:N]) map(from:y[0:N])for (i = 0; i < N; ++i)

y[i] = sin(x[i])*cos(x[i]);x[0:N] y[0:N]

x[0:N] y[0:N]

CPU Memory

GPU Memory

x = new double[N]; y = new double[N];for (i = 0; i < N; ++i)

x[i] = 0.1*i;

#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])

#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i)

y[i] = sin(x[i])*cos(x[i]);

#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])

x[0:N] y[0:N]

CPU Memory

GPU Memory

OpenMP runtime will automatically detect if

arrays x and y are mapped

x = new double[N]; y = new double[N]; for (i = 0; i < N; ++i) x[i] = 0.1*i;

#pragma omp target enter data map(to:x[0:N]) map(alloc:y[0:N])

#pragma omp target teams distribute parallel forfor (i = 0; i < N; ++i) y[i] = sin(x[i])*cos(x[i]);

functionA(x,y,N);

#pragma omp target exit data map(release:x[0:N]) map(from:y[0:N])

functionA(double *x, double *y, int N){#pragma omp target teams distribute parallel for map(to:x[0:N]) map(tofrom:y[0:N])for (i = 0; i < N; ++i) y[i] = 2.0*x[i] + y[i];

x[0:N] y[0:N]

CPU Memory

GPU Memory

Use case for reference counterssince x and y are mapped, maps here will be

effectively noops

Developers working on codes for simulations on CPUs and GPUs can also mix CUDA API and OpenMP4.5 directives/API call for memory/data management.

For example, in our implementation of AMG2013 benchmark we use OpenMP4.5 for memory allocation and data initialization, and some kernels, while we use cuSparse library for optimized to GPUs sparse matrix vector multiplications.

#pragma omp target enter data map(to: x_data[0:x_size],y_data[0:y_size])…#pragma omp target data use_device_ptr(x_data[0:x_size],y_data[0:y_size])

{cusparseDcsrmv(cu_spmv_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, num_rows, num_cols, nnz,

&alpha, cu_spmv_descr, d_A_data, d_A_i, d_A_j, x_data, &beta, y_data);}

Warning: Making code portable to different types of architectures will require

additional work! cuSparse is not available on systems without NVIDIA GPUs.

Managing memory and data using OpenMP 4.5:interoperability with NVIDIA libraries

Managing memory/data: deeply nested data structures

Class MC_Cell_state: qs_vec<double> _number_dencity

qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _taskClass MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh

qs_vector<MC_Domain> domain

Class MC_Domain: qs_vec<MC_Cell_state> cell_stateMC_Mesh_Domain mesh

Class MC_Cell_state: qs_vec<double> _number_dencity

qs_vec<task_precomputed_multigroup_macroscopic_cross_sections_type> _task

Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total

Class MC_Mesh_Domain:qs_vec<int> _nbrDomainGidqs_vec<int> _nbrRank

qs_vec<MC_Vector> _nodeqs_vec<MC_Facet_Adjacency_Cell> _cellConnectivityqs_vec<MC_Facet_Geometry_Cell> cell_geometry

Class MC_Vector { double; double double}

Class : task_precomputed_multigroup_macroscopic_cross_sections_typeqs_vector<double> _total Class : task_precomputed_multigroup_macroscopic_cross_sections_type

qs_vector<double> _total

Class MC_Vector { double; double double}Class MC_Vector { double; double double}

Class MC_Vector { double; double; double}

Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point

Class MC_Facet_AdjacencySubfacet_Adjacency subfacet

Class MC_Facet_Adjacency_Cellqs_vec<MC_Facet_Adjacency> facetqs_vec<int> point

Class MC_Facet_AdjacencySubfacet_Adjacency subfacet

Class Subfacet_AdjacencyMC_Subfacet_Adjacency_Event::Enum

event; MC_Location current;MC_Location adjacent;

Class MC_Location { int; int; int}

Class MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facetClass MC_Facet_Geometry_Cellqs_vec<MC_General_Plane> facet

Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}Class MC_General_Plane{double; double; double; double}

a {*y, size} y[0:N]

CPU Memory

GPU Memory

#pragma omp target enter data map(to:a[0:1])#pragma omp target enter data map(to:a->y[0:N])#pragma omp target{

a->y[3] += …}

#pragma omp target exit data map(release:a->y[0:N])#pragma omp target exit data map(release:a[0:1])

#pragma omp target data map(to:y[0:n], a[0:1]) #pragma omp target{

a->y = y;}#pragma omp target{

a->y[3] += …}

a {*y, size} y[0:N]

CPU Memory

GPU Memory

Managing memory/data: Unified Addressing

Example: std::vector

template <class T>

struct UMAllocator {

typedef T value_type;

UMAllocator() {}

template <class U> UMAllocator(const UMAllocator<U>& other);

T* allocate(std::size_t n)

T* ptr;

#ifdef USE_CUDA_MANAGED

cudaMallocManaged(&ptr, n*sizeof(T));

ptr = (T*) malloc(n*sizeof(T));

#endif

return ptr;

void deallocate(T* p, std::size_t n)

cudaFree(p);

free(p);

#endif

std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);

template <class T>

struct UMAllocator {

typedef T value_type;

UMAllocator() {}

template <class U> UMAllocator(const UMAllocator<U>& other);

T* allocate(std::size_t n)

T* ptr;

#ifdef USE_ATS

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0); //required today performance reason

cudaDeviceSynchronize();

#endif

return ptr;

void deallocate(T* p, std::size_t n)

free(p);

std::vector<Real_t, UMAllocator<Real_t> > m_dzz ;…m_zdd.resize(numNode);

We expect that in the future prefetching will be handled by the OS, and CUDA API will not be required

Managing memory/data:deeply nested data structures

Will not work on systems not supporting ATS

Managing memory/data on systems with ATS*

(address translation service)

*Part of this presentation includes IBM’s extensions for OpenMP4.5 and features of OpenMP5.0 already implemented in compilers supporting OpenMP4.5

int main(){

int N=20;

double *data = new double[N];

omp_set_default_device(0);

#pragma omp target teams distribute parallel for map(from:data[0:N])

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

printf("data[%d] = %g\n",i,data[i]);

delete [] data;

return 0;

Managing memory/data on systems with ATS

int main(){

int N=20;

#pragma omp target teams distribute parallel for is_device_ptr(data)

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

delete [] data;

return 0;

Systems with ATS enabled

int main(){

int N=20;

#pragma omp target teams distribute parallel for //is_device_ptr(data)

for (int i = 0; i < N; ++i)

data[i] = i*0.1;

for (int i = 0; i < N; i+=4)

delete [] data;

return 0;

Systems with ATS enabled and

export XLSMPOPTS=TARGETMEM=UIMPLICIT

int main(){

int N=20;

double *data, *data2;

data = new double[2];

data2 = new double[N];

#pragma omp target teams distribute parallel for map(from:data2[0:N])//is_device_ptr(data)

for (int i = 0; i < N; ++i)

data2[i] = data[i%2] + i*0.1;

delete [] data;

delete [] data2;

return 0;

Systems with ATS enabled and

export XLSMPOPTS=TARGETMEM=UIMPLICIT

Here “map” is not ignored,memory for data2 is allocated on the device and content of data2 is being copied from device to host

subroutine foo(a,n)

real*8, dimension(n) :: a

!$omp target teams distribute parallel do

do i=1,N ; a(i)=0.1*i; end do

program test_implicit_ats

integer, parameter :: N=20

real*8, dimension(:), allocatable :: data

allocate(data(N))

call foo(data,N)

print *,data(10)

nvprof ./a.out

1.9520us 160B 78.170MB/s Pinned Device Tesla V100-SXM2 [CUDA memcpy HtoD]

1.6000us - Tesla V100-SXM2 __xl_foo_l3_OL_1 [156]

2.0480us 160B 74.506MB/s Device Pinned Tesla V100-SXM2 [CUDA memcpy DtoH]

export XLSMPOPTS=TARGETMEM=UIMPLICITnvprof ./a.out

7.1040us - - - Tesla V100-SXM2 __xl_foo_l3_OL_1 [150]

Fortran

Contributed by Lixiang Luo, IBM Research

Simulations with 1 MPI rank/GPU Simulations with 2 MPI ranks/GPU [+MPS]

#ifdef USE_ATS

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);

#else …

cudaMallocManaged(&ptr, n*sizeof(T));

cudaMemPrefetchAsync(ptr,n*sizeof(T),0,0);

#else …

# MPI ranks

#nodes FOM/node: CUDA Managed

FOM: ATS

1000 166.7 312,782 327,581

1728 288 308,760 328,513

# MPI ranks

#nodes FOM/node: CUDA Managed

FOM: ATS

1000 83.3 332,393 358,852

1728 144 331,248 358,538

LULESH: performance with ATS and CUDA Managed Memory

OpenMP: Nested Parallel regions on CPUs and GPUs

Nested parallelism + concurrent execution on all devices

Nested parallelism + concurrent execution on all devices int main(){

double *x, *y;double DEVICE_FRACTION=0;int num_devices, i, chunk, j_start, N = 1024*1024*10;bool USE_DEVICE;x = new double[N]; y = new double[N];//enable nested parallelismomp_set_nested(1);//get number of devicesnum_devices = omp_get_num_devices();if (num_devices>0) DEVICE_FRACTION=0.9;#pragma omp parallel for num_threads(num_devices+1) private(chunk, j_start, USE_DEVICE)for( i < (num_devices+1); ++i){

if (i < num_devices){ omp_set_default_device(i);chunk = DEVICE_FRACTION * N / num_devices;j_start = chunk*I;USE_DEVICE=true;

} else {chunk = N; //defaultj_start = 0; //defaultUSE_DEVICE=false; //defaultif (num_devices > 0){

j_start = (DEVICE_FRACTION * N / num_devices) * num_devices;chunk = N – j_start;

}}initialize_x_and_y( x+j_start, y+j_start, chunk, j_start, USE_DEVICE);

}free(x); free(y);return 0;}

void initialize_x_and_y(double *x, double *y, int N, int offset, bool USE_DEVICE){#pragma omp target teams distribute parallel for map(from:x[0:N], y[0:N]) if(target:USE_DEVICE)

for (int i=0; i < N; ++i){x[i] = (offset + i) * 0.001; y[i] = (offset + i) * 0.003;

Nested parallelism: communication in LULESH#pragma omp parallel sections private(pmsg,emsg,cmsg,destAddr){

#pragma omp section{

if (planeMin | planeMax) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm] ;

#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0 ; fi<xferFields; ++fi) { for (Index_t i=0; i<sendCount; ++i) { destAddr[i+sendCount*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;

#pragma omp section{

if (rowMin && planeMin && not_planeOnly) {…destAddr = &domain.commDataSend[pmsg * maxPlaneComm + emsg * maxEdgeComm] ;

#pragma omp target teams distribute parallel for collapse(2) if(target:USE_DEVICE ) is_device_ptr(destAddr) thread_limit(64)for (Index_t fi=0; fi<xferFields; ++fi) { for (Index_t i=0; i<dx; ++i) { destAddr[i + dx*fi] = ptr_fi[fi][i] ; } }MPI_Isend(destAddr, …) ;

}}…..

Nested parallelism: communication in LULESH

#pragma omp parallel num_threads(2){if (omp_get_thread_num() == 0){

/* evaluate time constraint */CalcCourantConstraintForElems(domain,

domain.regElemSize(r),domain.regElemlist(r),domain.qqc(),domain.dtcourant()) ;

}if (omp_get_thread_num() == (omp_get_num_threads() -

1) ){/* check hydro constraint */CalcHydroConstraintForElems(domain,

domain.regElemSize(r),domain.regElemlist(r),domain.dvovmax(),domain.dthydro()) ;

Contains:#pragma omp target teams distribute parallel for \if(target:USE_DEVICE) map(tofrom:pos) map(from:…)

Asynchronous execution

void CalcEnergyForElems( …..){ …

#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }

#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {

e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;

CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList);

#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){

const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}

void CalcPressureForElems(Real_t* p_new, …. )

#pragma omp target teams … if(target:USE_DEVICE)for (Index_t i = 0; i < length ; ++i) {

Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;

#pragma omp target … if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i];

Asynchronous executionin LULESH

void CalcEnergyForElems( …..){ …

#pragma omp target teams distribute parallel for is_device_ptr(compHalfStep,delvc, …q_old) nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0 ; i < length ; ++i) { Real_t vhalf = Real_t(1.) / (Real_t(1.) + compHalfStep[i]) ; …….. }

#pragma omp target teams distribute parallel for is_device_ptr(e_new,work) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i) {

e_new[i] += Real_t(0.5) * work[i];if (FABS(e_new[i]) < e_cut) e_new[i] = Real_t(0.) ;if ( e_new[i] < emin ) e_new[i] = emin ;

CalcPressureForElems(p_new, bvc, pbvc, e_new, compression, vnewc,pmin, p_cut, eosvmax, length, regElemList, dep_flag);

#pragma omp target teams distribute parallel for is_device_ptr(delvc, … ,regElemList) nowait depend(inout:dep_flag) if(target:USE_DEVICE)for (Index_t i = 0 ; i < length ; ++i){

const Real_t sixth = Real_t(1.0) / Real_t(6.0) ;….}

void CalcPressureForElems(Real_t* p_new, …. int dep_flag)

#pragma omp target teams … nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0; i < length ; ++i) {Real_t c1s = Real_t(2.0)/Real_t(3.0) ;bvc[i] = c1s * (compression[i] + Real_t(1.));pbvc[i] = c1s;

#pragma omp target … nowait depend(inout:dep_flag) if(target:USE_DEVICE)

for (Index_t i = 0 ; i < length ; ++i){ Index_t elem = regElemList[i]; …}

50,000

100,000

150,000

200,000

250,000

163K158K 158K

203K 197K 196K

# of Nodes

LULESH (PWR8 + Pascal)

Asynch

Asynchronous executionin LULESH

[implicit] Placing Data in GPU’s Shared Memory

BLK_SZ is known at compile time.

VAL is team private

Performance.achieved BW is measured as (Nr*Nc*2*8bytes)/(kernel time)BLK_SZ=32we measure ~900GB/s wile using shared memory and ~40GB/s without …

Acknowledgement:

IBM Compiler and OpenMP-runtime team:

Ettore Tiotto, Tarique Islam, Bardia Mahjour, Zarko Todorovski, Wael Yehia, Rafik Zurob, Wang Chen, Kelvin Li,

Alexandre Eschenberger, George Bercea, Kevin O’Brien

LLNL’s personnel : Riyaz Haque, Tom Scogland

Porting and Optimizing Applications for AC922 servers using ......Speaker name, Title...

Documents

Gcc porting

Efficient and simple porting processes make one day porting a reality

Head Porting

RAK439 Porting Manual · RAK439 Porting Manual RAK439 Porting Manual Shenzhen Rakwireless Technology Co.,Ltd Email: info@rakwireless.com

Android Porting

PORTING KIT - ultsol.com Porting Kit micro-ITRON Porting Kit Windows Porting Kit OS Changer Porting Kits are available for installation on either Windows or Linux host ... uCos ®

Porting to Arm

Porting Newlib

CENTROID Porting Machine

Porting Android

Porting and optimizing GPAW - Prace Training … · Porting and optimizing GPAW Jussi Enkovaara Computing Environment and Applications CSC – the finnish IT center for science

Porting motif to_qt

Beyond Porting

Technical Notes Porting Set 7 notes porting set 7.1c.pdf · Group: Product Information Edition: B Date: April, 2001. i Technical Notes Porting Set 7.1c 1 Porting set contents 1-1

Splotch: porting and optimizing for the Xeon Phi Reprints ... · require high-performance computing (HPC) resources to overcome issues related to rendering large and complex data

NVIDIA V100 Performance Characteristics on IBM AC922 …on-demand.gputechconf.com/gtc/2018/presentation/...on-ibm-power-9... · IBM AC922 (POWER9) IM® Power System™ Accelerated

Rom Porting MTK6589

Porting Air Power

WINC1500 SPI porting guide - Atmel Community · WINC1500 SPI porting guide porting guide

Porting and Optimizing DSP56800 Applications to DSP56800E · 2016-11-23 · AN2095/D Rev. 0, 04/2001 Porting and Optimizing DSP56800 Applications to DSP56800E Application Note by