84
Introduction to GPU Computing with OpenACC Michael Wolfe [email protected] http://www.pgroup.com November 2012 1-1

GPU Programming with CUDA and the PGI Accelerator ... · PDF filehost memory, device memory (hardware cache), constant memory, (configurable) shared memory, register ... Large scale

Embed Size (px)

Citation preview

Introduction to GPU Computing with OpenACC

Michael Wolfe [email protected]

http://www.pgroup.com

November 2012

1-1

The latest updates to these tutorial notes are available at:

http://www.pgroup.com/lit/presentations/

sc12_tutorial_openacc_intro.pdf

Also available:

Training at Pittsburgh Supercomputer Center

January 15-16

www.psc.edu/index.php/training/openacc-gpu-programming

Updates to these Tutorial Notes

1-2

Outline

Introduction

GPU architecture vs. Host architecture

GPU parallel programming vs. Host parallel programming

Introduction to OpenACC Directives

Elements of Directives-based Model

Controlling Data Movement

Explicit Parallelism

Reductions, the Loop directive

Interacting with CUDA

Problems, Pitfalls, Wrapup

Appropriate algorithms and data structures

Obvious Performance Bottlenecks

1-3

AMD “Magny-Cours”

Host Architecture Features

Register files (integer, float)

Functional units (integer, float, address), Icache, Dcache

Execution pipeline (fetch,decode,issue,execute,cache,commit)

branch prediction, hazards (control, data, structural)

pipelined functional units, superpipelining, register bypass

stall, scoreboarding, reservation stations, register renaming

Multiscalar execution (superscalar, control unit lookahead)

LIW (long instruction word)

Multithreading, Simultaneous multithreading

Vector instruction set

Multiprocessor, Multicore, coherent caches (MESI protocols) 1-5

Making It Faster

Processor:

Faster clocks

More work per clock:

superscalar

VLIW

more cores

vector / SIMD instructions

Memory

Latency reduction

Latency tolerance

1-6

Abstracted x64+Tesla Architecture

1-7

Abstracted x64+Fermi Architecture

1-8

1-9

Abstracted x64+Accelerator Architecture

1-10

GPU Architecture Features

Optimized for high degree of regular parallelism

Classically optimized for low precision

High bandwidth memory

Highly multithreaded (slack parallelism)

Hardware thread scheduling

Non-coherent software-managed data caches

some hardware data caches as well

No multiprocessor memory model guarantees

some guarantees with fence operations

1-11

Tesla-10 Features Summary

Massively parallel thread processors

Organized into multiprocessors

up to 30, see deviceQuery or pgaccelinfo

Physically: 8 thread processors per multiprocessor

Logically: 32 threads per warp

Memory hierarchy

host memory, device memory, constant memory, shared memory,

register

Queue of operations (kernels) on device

1-12

Fermi (Tesla-20) Features Summary

Massively parallel thread processors

Organized into multiprocessors

up to 16, see deviceQuery or pgaccelinfo

Physically: two groups of 16 thread processors per multiprocessor

Logically: still 32 threads per warp

Memory hierarchy

host memory, device memory (two level hardware cache), constant

memory, (configurable) shared memory, register

Queue of operations (kernels) on device

ECC memory protection (supported, not default)

Much improved double precision performance

Hardware 32-bit integer multiply 1-13

Kepler (K20) Features Summary

Massively parallel thread processors

Organized into multiprocessors

8 or more, see deviceQuery or pgaccelinfo

Physically: 12 groups of 16 SP thread processors + 4 groups of

DP thread processors per SM

Logically: still 32 threads per warp

Memory hierarchy

host memory, device memory (hardware cache), constant memory,

(configurable) shared memory, register

Queue of operations (kernels) on device

HyperQ – more asynchronous compute operations

ECC memory protection

less memory per "CUDA core" 1-14

Parallel Programming on CPUs

Instruction level parallelism (ILP)

Loop unrolling, instruction scheduling

Vector parallelism

Vectorized loops (or vector intrinsics)

Thread level / Multiprocessor / multicore parallelism

Parallel loops, parallel tasks

Posix threads, OpenMP, Cilk, TBB, .....

Large scale cluster / multicomputer parallelism

MPI (& HPF, co-array Fortran, UPC, Titanium, X10, Fortress,

Chapel)

1-15

pthreads main routine

call pthread_create( t1, NULL, jacobi, 1 )

call pthread_create( t2, NULL, jacobi, 2 )

call pthread_create( t3, NULL, jacobi, 3 )

call jacobi( 4 )

call pthread_join( t1, NULL )

call pthread_join( t2, NULL )

call pthread_join( t3, NULL )

1-16

pthreads subroutine

subroutine jacobi( threadnum )

lchange = 0

do j = threadnum+1, n-1, numthreads

do i = 2, m-1

newa(i,j) = w0*a(i,j) + &

w1 * (a(i-1,j) + a(i,j-1) + &

a(i+1,j) + a(i,j+1)) + &

w2 * (a(i-1,j-1) + a(i-1,j+1) + &

a(i+1,j-1) + a(i+1,j+1))

lchange = max(lchange,abs(newa(i,j)-a(i,j)))

enddo

enddo

call pthread_mutex_lock( lck )

change = max(change,lchange)

call pthread_mutex_unlock( lck )

end subroutine 1-17

Jacobi Relaxation with OpenMP directives

change = 0

!$omp parallel private(i,j)

!$omp do reduction(max:change)

do j = 2, n-1

do i = 2, m-1

newa(i,j) = w0*a(i,j) + &

w1 * (a(i-1,j) + a(i,j-1) + &

a(i+1,j) + a(i,j+1)) + &

w2 * (a(i-1,j-1) + a(i-1,j+1) + &

a(i+1,j-1) + a(i+1,j+1))

change = max(change,abs(newa(i,j)-a(i,j)))

enddo

enddo

!$omp end parallel

1-18

Behind the Scenes

Compiler generates code for N threads:

split up the iterations across N threads

accumulate N partial sums (no synchronization)

accumulate final sum as threads complete

Assumptions

uniform memory access costs

coherent cache mechanism

1-19

More Behind the Scenes

Virtualization penalties

load balancing

cache locality

vectorization within the threads

thread management

loop scheduling (which thread does what iteration)

NUMA memory access penalty

1-20

Parallel Programming on GPUs

High degree of regular parallelism

lots of scalar threads

threads organized into thread groups / blocks

SIMD, pseudo-SIMD

thread groups organized into grid

MIMD

Languages

CUDA, OpenCL, graphics: OpenGL, DirectX

may include vector datatypes (float4, int2)

CAPS HMPP

Platforms

formerly ArBB (fka Rapidmind, now owned by Intel)

Thrust library

1-21

GPU Programming

Allocate data on the GPU

Move data from host, or initialize data on GPU

Launch kernel(s)

GPU driver can generate ISA code at runtime

preserves forward compatibility without requiring

ISA compatibility

Gather results from GPU

Deallocate data

1-22

Appropriate GPU programs

Characterized by nested parallel loops

High compute intensity

Regular data access

Isolated host/GPU data movement

1-23

Behind the Scenes

What you write is what you get

Implicitly parallel

threads into warps

warps into thread groups

thread groups into a grid

Hardware thread scheduler

Highly multithreaded

1-24

CUDA C and CUDA Fortran

Simple introductory program

Programming model

Low-level Programming with CUDA

Building CUDA programs

2-25

VADD on Host

subroutine host_vadd(A,B,C,N)

real(4) :: A(N), B(N), C(N)

integer :: N, i

do i = 1,N

C(i) = A(i) + B(i)

enddo

end subroutine

2-26

void host_vadd( float* A, float* B, float* C, int n ){

int i;

for( i = 0; i < n; ++i ){

C[i] = A[i] + B[i];

}

}

CUDA C VADD Device Code

__global__ void vaddkernel( float* A, float* B,

float* C, int n ){

int i;

i = blockIdx.x*blockDim.x + threadIdx.x;

if( i <= N ) C[i] = A[i] + B[i];

}

2-27

CUDA Fortran VADD Device Code

module kmod

use cudafor

contains

attributes(global) subroutine vaddkernel(A,B,C,N)

real(4), device :: A(N), B(N), C(N)

integer, value :: N

integer :: i

i = (blockidx%x-1)*blockdim%x + threadidx%x

if( i <= N ) C(i) = A(i) + B(i)

end subroutine

end module

2-28

CUDA C VADD Host Code

void vadd( float* A, float* B, float* C, int n ){

float *Ad, *Bd, *Cd;

cudaMalloc( (void**)&Ad, n*sizeof(float) );

cudaMalloc( (void**)&Bd, n*sizeof(float) );

cudaMalloc( (void**)&Cd, n*sizeof(float) );

cudaMemcpy( Ad, A, n*sizeof(float),

cudaMemcpyHostToDevice );

cudaMemcpy( Bd, B, n*sizeof(float),

cudaMemcpyHostToDevice );

vaddkernel<<< (n+31)/32, 32 >>>( Ad, Bd, Cd, n );

cudaMemcpy( C, Cd, n*sizeof(float),

cudaMemcpyDeviceToHost );

cudaFree( Ad ); cudaFree( Bd ); cudaFree( Cd );

}

2-29

CUDA Fortran VADD Host Code

subroutine vadd( A, B, C )

use kmod

real(4), dimension(:) :: A, B, C

real(4), device, allocatable, dimension(:):: &

Ad, Bd, Cd

integer :: N

N = size( A, 1 )

allocate( Ad(N), Bd(N), Cd(N) )

Ad = A(1:N)

Bd = B(1:N)

call vaddkernel<<<(N+31)/32,32>>>( Ad, Bd, Cd, N )

C(1:N) = Cd

deallocate( Ad, Bd, Cd )

end subroutine

2-30

CUDA Programming

Host code

Optional: select a GPU

Allocate device memory

Copy data to device memory

Launch kernel(s)

Copy data from device memory

Deallocate device memory

Device code

Scalar thread code, limited operations

Implicitly parallel

threads within a thread block, essentially SIMD

thread blocks within a grid, MIMD parallelism

2-31

CUDA Programming: the GPU

A scalar program, runs on one thread

All threads run the same code

Executed in thread groups

grid may be 1D or 2D (max 65535x65535)

3D in CUDA 4.0 (Fermi)

thread block may be 1D, 2D, or 3D

max total size 512 (Tesla) or 1024 (Fermi)

blockidx gives block index in grid (x,y)

threadidx gives thread index within block (x,y,z)

Kernel runs implicitly in parallel

thread blocks scheduled by hardware on any multiprocessor

runs to completion before next kernel

2-32

Directive-Based GPU Programming

Simple introductory program

Programming model

Building Accelerator programs

High-level Programming

3-33

Jacobi Relaxation

change = tolerance + 1.0

do while(change > tolerance)

change = 0

do i = 2, m-1

do j = 2, n-1

newa(i,j) = w0*a(i,j) + &

w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + &

w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1))

change = max(change,abs(newa(i,j)-a(i,j)))

enddo

enddo

do i = 2, m-1

do j = 2, n-1

a(i,j) = newa(i,j)

enddo

enddo

enddo

3-34

Jacobi Relaxation

do{

change = 0;

for( j = 1; j < m-1; ++j ){

for( i = 1; i < n-1; ++i ){

newa[j][i] = w0*a[j][i] +

w1 * (a[j][i-1] + a[j-1][i] +

a[j][i+1] + a[j+1][i]) +

w2 * (a[j-1][i-1] + a[j+1][i-1] +

a[j-1][i+1] + a[j+1][i+1]);

change = fmax(change,fabs(newa[j][i]-a[j][i]));

}

}

tmp = a; a = newa; newa = tmp;

}while( change > tolerance );

3-35

Jacobi Relaxation

change = tolerance + 1.0

!$omp parallel shared(change)

do while(change > tolerance)

change = 0

!$omp do reduction(max:change) private(i,j)

do i = 2, m-1

do j = 2, n-1

newa(i,j) = w0*a(i,j) + &

w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + &

w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1))

change = max(change,abs(newa(i,j)-a(i,j)))

enddo

enddo

!$omp do private(i,j)

do i = 2, m-1

do j = 2, n-1

a(i,j) = newa(i,j)

enddo

enddo

enddo

3-36

Jacobi Relaxation

#pragma omp parallel shared(change) private(tmp)

do{

change = 0;

#pragma omp for private(i,j) reduction(max:change)

for( j = 1; j < m-1; ++j ){

for( i = 1; i < n-1; ++i ){

newa[j][i] = w0*a[j][i] +

w1 * (a[j][i-1] + a[j-1][i] +

a[j][i+1] + a[j+1][i]) +

w2 * (a[j-1][i-1] + a[j+1][i-1] +

a[j-1][i+1] + a[j+1][i+1]);

change = fmax(change,fabs(newa[j][i]-a[j][i]));

}

}

tmp = a; a = newa; newa = tmp;

}while( change > tolerance );

3-37

change = tolerance + 1.0

!$acc data create(newa(1:m,1:n)) copy(a(1:m,1:n))

do while(change > tolerance)

change = 0

!$acc kernels reduction(max:change)

do i = 2, m-1

do j = 2, n-1

newa(i,j) = w0*a(i,j) + &

w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + &

w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1))

change = max(change,abs(newa(i,j)-a(i,j)))

enddo

enddo

do i = 2, m-1

do j = 2, n-1

a(i,j) = newa(i,j)

enddo

enddo

!$acc end kernels

enddo

!$acc end data 3-38

#pragma acc data create(newa[0:n-1][0:m-1])\

copy(a[0:n-1][0:m-1])

do{

change = 0;

#pragma acc kernels private(i,j) reduction(max:change)

for( j = 1; j < m-1; ++j ){

for( i = 1; i < n-1; ++i ){

newa[j][i] = w0*a[j][i] +

w1 * (a[j][i-1] + a[j-1][i] +

a[j][i+1] + a[j+1][i]) +

w2 * (a[j-1][i-1] + a[j+1][i-1] +

a[j-1][i+1] + a[j+1][i+1]);

change = fmax(change,fabs(newa[j][i]-a[j][i]));

}

}

tmp = a; a = newa; newa = tmp;

}while( change > tolerance );

3-39

Why use Accelerator Directives?

Productivity

Higher level programming model

a la OpenMP

Portability

ignore directives, portable to the host

portable to other accelerators

performance portability

Performance feedback

Downsides

it’s not as easy as inserting a few directives

a good host algorithm is not necessarily a good GPU algorithm

3-40

Basic Syntactic Concepts

Fortran accelerator directive syntax

!$acc directive [clause]...

& continuation

Fortran-77 syntax rules

!$acc or C$acc or *$acc in columns 1-5

continuation with nonblank in column 6

C accelerator directive syntax

#pragma acc directive [clause]... eol

continue to next line with backslash

3-41

Construct

construct is single-entry/single-exit construct

in Fortran, delimited by begin/end directives

in C, a single statement, or {...} region

no jumps into/out of region, no return

compute region contains loops to send to GPU

loop iterations translated to GPU threads

loop indices become threadidx/blockidx indices

data region encloses compute regions

data moved at region boundaries

Construct vs. Region

construct is the lexical instance in the program

region is the dynamic instance when running the program

3-42

Appropriate Algorithm

Nested parallel loops

iterations map to threads

parallelism means threads are independent

nested loops means lots of parallelism

Regular array indexing

allows for stride-1 array fetches

3-43

#pragma acc data create(newa[0:n][0:m])\

copy(a[0:n][0:m])

do{

change = 0;

#pragma acc kernels private(i,j) reduction(max:change)

for( j = 1; j < m-1; ++j ){

for( i = 1; i < n-1; ++i ){

newa[j][i] = w0*a[j][i] +

w1 * (a[j][i-1] + a[j-1][i] +

a[j][i+1] + a[j+1][i]) +

w2 * (a[j-1][i-1] + a[j+1][i-1] +

a[j-1][i+1] + a[j+1][i+1]);

change = fmax(change,fabs(newa[j][i]-a[j][i]));

}

}

tmp = a; a = newa; newa = tmp;

}while( change > tolerance );

3-44

#pragma acc data create(newa[0:n][0:m])\

copy(a[0:n][0:m])

do{

change = 0;

#pragma acc parallel private(i,j) reduction(max:change)

{

#pragma acc loop

for( j = 1; j < m-1; ++j ){

for( i = 1; i < n-1; ++i ){

newa[j][i] = w0*a[j][i] +

w1 * (a[j][i-1] + a[j-1][i] +

a[j][i+1] + a[j+1][i]) +

w2 * (a[j-1][i-1] + a[j+1][i-1] +

a[j-1][i+1] + a[j+1][i+1]);

change = fmax(change,fabs(newa[j][i]-a[j][i]));

}

}

}

tmp = a; a = newa; newa = tmp;

}while( change > tolerance );

3-45

Behind the Scenes

compiler determines parallelism (kernels)

or applies parallelism (parallel)

compiler generates thread code

split up the iterations into threads, thread groups

inserts code to use software data cache

accumulate partial sum

second kernel to combine final sum

compiler also inserts data movement

compiler or user determines what data to move

data moved at boundaries of data/compute region

3-46

Behind the Scenes

virtualization penalties

fine grain control

thread scheduling

shared memory usage

loop unrolling

3-47

Building Accelerator Programs

pgfortran –acc a.f90

pgcc –acc a.c

Other options:

-ta=nvidia[,cc1x|cc2x|cc3x]

default in siterc file:

set COMPUTECAP=30;

-ta=nvidia[,cuda5.0]

default in siterc file:

set DEFCUDAVERSION=5.0;

-ta=nvidia,fastmath,nofma

Enable compiler feedback with –Minfo or –Minfo=accel

3-48

Program Execution Model

Host

executes most of the program

allocates accelerator memory

initiates data copy from host memory to accelerator

sends kernel code to accelerator

queues kernels for execution on accelerator

waits for kernel completion

initiates data copy from accelerator to host memory

deallocates accelerator memory

Accelerator

executes kernels, one after another

concurrently, may transfer data between host and accelerator

3-49

Controlling Data Movement

data clauses

data construct and data region

update directive

asynchronous data movement, the wait directive

3-50

Data Clauses

C #pragma acc data copyin(a[0:n]) copyout(r[0:n])

{

....

}

Fortran !$acc data copyin(a(1:n)) copyout(r(1:n))

...

!$acc end data

3-51

Data Clauses

C #pragma acc data copyin(a[0:n][0:m]) \ copy(r[0:n][0:m]) { ... }

Fortran !$acc data copyin(a(1:m,1:n)) copy(r(:,:)) ... !$acc end data

3-52

Data Clauses

C #pragma acc data copyin(a[0:n][0:m]) \ create(r[0:n][0:m]) { ... }

Fortran !$acc data copyin(a(1:m,1:n)) create(r) ... !$acc end data

3-53

Data Clauses

C #pragma acc data copyin(a[0:n][0:m]) \ create(r[1:n-1][1:m-1]) { ... }

Fortran !$acc data copyin(a(1:m,1:n)) & create(r(2:m-1,2:n-1)) ... !$acc end data

3-54

Data Clauses

C #pragma acc data copyin(a[0:n][0:m]) /* data copied to Accelerator here */ { ... } /* data copied to Host here */

Fortran !$acc data copyin(a(1:m,1:n)) ! data copied to Accelerator here ... ! data copied to Host here !$acc end data

3-55

Data Clauses

C #pragma acc data copyin(a[0:n][0:m]) \ present(r[1:n-1][1:m-1]) { ... }

Fortran !$acc data copyin(a(1:m,1:n)) & present(r(2:m-1,2:n-1)) ... !$acc end data

3-56

Data Clauses

C #pragma acc data present_or_copyin(a[0:n]) \ present(r[1:n-1]) { ... }

Fortran !$acc data present_or_copyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data

3-57

Data Clauses

C #pragma acc data pcopyin(a[0:n]) \ present(r[1:n-1]) { ... }

Fortran !$acc data pcopyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data

3-58

Data Clauses

copy( list )

copyin( list )

copyout( list )

create( list )

present( list )

present_or_copy( list ) pcopy( list )

present_or_copyin( list ) pcopyin( list )

present_or_copyout( list ) pcopyout( list )

present_or_create( list ) pcreate( list )

deviceptr( list )

3-59

Data Region

C

#pragma acc data data-clauses if( condition )

{

....

}

Fortran

!$acc data data-clauses if( condition )

....

!$acc end data

May be nested and may contain compute regions

May not be nested within a compute region

May contain procedure calls

3-60

Update Directive

C

#pragma acc update host( list )

#pragma acc update device( list )

Fortran

!$acc update host( list )

!$acc update device( list )

data must be in a data clause for an enclosing data region

implies present( list )

both may be on a single line

update host( list ) device( list )

3-61

Asynchronous Update Directive

C

#pragma acc update host( list ) async(1)

#pragma acc update device( list ) async(2)

Fortran

!$acc update host( list ) async(3)

!$acc update device( list ) async(n)

async value should be >= 0

mapped down to some number of actual async queues

updates with same value will execute in program order

async with no value is same as async(acc_async_noval)

async(acc_async_sync) is same as no async clause

host program may continue

3-62

Wait Directive

C

#pragma acc wait( 2 )

#pragma acc wait

Fortran

!$acc wait( n, n-1 )

!$acc wait

wait with no values waits for ALL asynchronous queues

wait with multiple values waits for each queue

NO implied wait at end of data construct

async updates in a data region require wait

3-63

Data Regions Across Procedures

subroutine sub( a, b )

real :: a(:), b(:)

!$acc kernels copyin(b)

do i = 1,n

a(i) = a(i) * b(i)

enddo

!$acc end kernels

...

end subroutine

subroutine bus(x, y)

real :: x(:), y(:)

!$acc data copy(x)

call sub( x, y )

!$acc end data

3-64

Data Regions Across Procedures

void sub( float* a, float* b, int n ){

int i;

#pragma acc kernels copyin(b[0:n])

for( i = 0; i < n; ++i )

a[i] *= b[i];

...

}

void bus( float* x, float* y, int n ){

#pragma acc data copy(x[0:n])

{

sub( x, y, n );

}

...

3-65

Loop Directive

C

#pragma acc loop clause...

for( i = 0; i < n; ++i ){

....

}

Fortran

!$acc loop clause...

do i = 1, n

3-66

Kernels Loop Scheduling Clauses

!$acc loop gang

runs in ‘gang’ mode only (blockIdx)

does not declare that the loop is in fact parallel (use independent)

!$acc loop gang(32)

runs in ‘parallel’ mode only with gridDim == 32 (32 blocks)

!$acc loop vector(128)

runs in ‘vector’ mode (threadIdx) with blockDim == 128 (128

threads)

vector size, if present, must be compile-time constant

!$acc loop gang vector(128)

strip mines loop

inner loop runs in vector mode, 128 threads (threadIdx)

outer loop runs across thread blocks (blockIdx) 3-67

Loop Scheduling Clauses

Want stride-1 loop to be in ‘vector’ mode (threadIdx)

look at –Minfo messages!

Want lots of parallelism

3-68

Loop Directive Clauses

Scheduling Clauses

vector or vector(n)

gang or gang(n)

worker or worker(n)

independent

use with care, overrides compiler analysis for dependence, private

variables

private( list )

private data for each iteration of the loop

different from local (how?)

reduction( red:var )

reduction across the loop

3-69

Compiler Feedback Messages

Data related

Generating copyin(b(1:n,1:m))

Generating copyout(b(2:n-1,2:m-1))

Generating copy(a(1:n,1:n))

Generating create(c(1:n,1:n))

Loop or kernel related

Loop is parallelizable

Accelerator kernel generated

Barriers to GPU code generation

No parallel kernels found, accelerator region ignored

Loop carried dependence due to exposed use of ... prevents

parallelization

Parallelization would require privatization of array ...

Compiler Messages Continued

Memory optimization related

Cached references to size [(x+2)x(y+2)] block of

‘b’

Non-stride-1 memory accesses for ‘a’

Combined Directives

C #pragma acc kernels loop copyin(a[0:n])\

gang vector(32)

for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;

Fortran

!$acc kernels loop copyin(a(1:n)) &

gang vector(32)

do i = 1,n

r(i) = a(i) * 2.0

enddo

3-72

Parallel Region

C #pragma acc parallel loop copyin(a[0:n])

for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;

Fortran

!$acc parallel loop copyin(a(1:n))

do i = 1,n

r(i) = a(i) * 2.0

enddo

3-73

Selecting Device

C

#include <accel.h>

n = acc_get_num_devices(acc_device_nvidia);

...

acc_set_device_num( 1, acc_device_nvidia);

Fortran

use accel_lib

n = acc_get_num_devices( acc_device_nvidia )

...

call acc_set_device_num( 0, acc_device_nvidia)

Environment Variable

setenv ACC_DEVICE_NUM 1

export ACC_DEVICE_NUM=1

3-74

Performance Profiling

TAU (Tuning and Analysis Utilities, University of Oregon)

collects performance information

cudaprof (NVIDIA)

gives a trace of kernel execution

PGI_ACC_TIME environment variable

dump of region-level and kernel-level performance

upload/download data movement time

kernel execution time

pgcollect a.out

PGI_ACC_PROFILE environment variable

enables profile data collection for accelerator regions

ACC_NOTIFY environment variable

prints one line for each kernel invocation 3-75

Performance Profiling

Accelerator Kernel Timing data

f3.f90

smooth

24: region entered 1 time

time(us): total=1116701 init=1115986 region=715

kernels=22 data=693

w/o init: total=715 max=715 min=715 avg=715

27: kernel launched 5 times

grid: [7x7] block: [16x16]

time(us): total=17 max=10 min=1 avg=3

34: kernel launched 5 times

grid: [7x7] block: [16x16]

time(us): total=5 max=1 min=1 avg=1

3-76

Directives Summary acc kernels

if(condition)

copy(list) copyin(list) copyout(list) create(list)

acc parallel

if(condition)

num_gangs(n) num_workers(n) vector_length(n)

acc data

copy(list) copyin(list) copyout(list) create(list)

if(condition)

acc update device(list) host(list)

acc loop

gang(width) worker(width) vector(width)

independent

private(list)

reduction(red:var)

3-77

Runtime Summary

acc_get_num_devices( acc_device_nvidia )

acc_set_device_num( n, acc_device_nvidia )

acc_set_device( acc_device_nvidia | acc_device_host )

acc_get_device()

acc_init( acc_device_nvidia | acc_device_host )

3-78

Environment Variable Summary

ACC_DEVICE_NUM

ACC_DEVICE

ACC_NOTIFY

PGI_ACC_TIME

3-79

Command Line Summary

-ta=nvidia

-ta=nvidia,nofma

-ta=nvidia,fastmath

-ta=nvidia,[cc1x|cc2x|cc3x] (multiple allowed)

-ta=nvidia,maxregcount:n

-ta=nvidia,cuda5.0

3-80

Accelerator Programming Timeline

3-81

late 1990s – GPGPU Programming

early 2000s – Brook project (and others)

2007 – NVIDIA CUDA release, SC07 tutorial

HiCUDA, HMPP (and others)

EXOCHI: PLDI 2007 (Intel)

2008 – PGI Accelerator Programming Model

Larrabee paper at SIGGRAPH (Intel)

2009 – OpenMP BoF at SC09

Programming Model for Heterogeneous x86: PLDI 2009 (Intel)

Larrabee featured in Justin Rattner (Intel) opening address

2010 – OpenMP Accelerator committee

2011 – OpenACC API 1.0

Language Extensions for Offload (LEO, Intel)

Where to get help

• PGI Customer Support - [email protected]

• PGI User's Forum - http://www.pgroup.com/userforum/index.php

• PGI Articles - http://www.pgroup.com/resources/articles.htm

http://www.pgroup.com/resources/accel.htm

• PGI User's Guide - http://www.pgroup.com/doc/pgiug.pdf

• CUDA Fortran Reference Guide -

http://www.pgroup.com/doc/pgicudafortug.pdf

OpenACC Members

CAPS Entreprise

Cray, Inc.

NVIDIA

The Portland Group, Inc.

Allinea

CSCS (Switzerland)

Oak Ridge National Lab

Tech. Univ. Dresden

Univ. Houston

3-83

Sandia

Georgia Tech

Rogue Wave

OpenACC Supporters

Copyright Notice

© Contents copyright 2009-2012, The Portland Group, Inc. This

material may not be reproduced in any manner without the

expressed written permission of The Portland Group.

3-84